UNLOAD Redshift: append - amazon-web-services

I'd like to UNLOAD data from Redshift table into already existing S3 folder, in a similar way of what happens in Spark with the write option "append" (so creating new files in the target folder if this already exists).
I'm aware of the ALLOWOVERWRITE option but this deletes the already existing folder.
Is it something supported in Redshift? If not, what approach is recommended? (it would be anyway a desired feature I believe...)

One solution that could solve the issue is to attach another unique suffix after the folder
e.g.
unload ('select * from my_table')
to 's3://mybucket/first_folder/unique_prefix_' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole';
If you add unique_prefix_ after the first folder level, all your new files, will start with your unique_prefix_ during the unload operation, therefore you don't need any ALLOWOVERWRITE.
The only issue of this approach is that if you unloaded data change, you might have mix schema for your unloaded data.

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)

is there any way to setup s3 bucket to get append to the existing object for each run?

We have a requirement to append to the existing S3 object, when we run the spark application every hour. I have tried this code:
df.coalesce(1).write.partitionBy("name").mode("append").option("compression", "gzip").parquet("s3n://path")
This application is creating new parquet files for every run. Hence, I am looking for a workaround to achieve this requirement.
Question is:
How we can configure the S3 bucket to get append to the existing object?
It is not possible to append to objects in Amazon S3. They can be overwritten, but not appended.
There is apparently a sneaky method where a file can be multi-part copied, with the 'source' set to the file and then set to some additional data. However, that cannot be accomplished in the method you show.
If you wish to add additional data to an External Table (eg used by EMR or Athena), then simply add an additional file in the correct folder for the desired partition.

Selecting specific files for athena

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g.
s3://bucketname/prefix/year_2019*.csv
The documentation is very clear about it and it is not allowed.
From:
https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
Athena reads all files in an Amazon S3 location you specify in the
CREATE TABLE statement, and cannot ignore any files included in the
prefix. When you create tables, include in the Amazon S3 path only the
files you want Athena to read. Use AWS Lambda functions to scan files
in the source location, remove any empty files, and move unneeded
files to another location.
I will like to know if the community has found some work-around :)
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

Copying multiple tables (or entire schema) from one cluster to another

I understand that AWS doesn't support a direct copy from one cluster to another for a given table. We need to UNLOAD from one and then COPY to another. However this applies to a table. Does it apply to schema as well?
say I have a schema that looks like
some_schema
|
-- table1
-- table2
-- table3
another_schema
|
-- table4
-- table5
and I want to copy tsome_schema to another cluster, but don't need another_schema. Making a snapshot doesn't make sense if there are too many of another_schema (say, another_schema2, another_schema3, another_schema4, etc., each with multiple tables in it)
I know I can do UNLOAD some_schema.table1 and then COPY some_schema.table1, but what can I do if I just want to copy the entire some_schema?
I believe unload a schema is not available, but you have couple of options based on the size of your cluster and number of tables you like to copy to the new cluster.
Create a script to generate UNLOAD and LOAD commands based on your schemas you like to copy
Create a snapshot, restore tables selectively. https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html
If the number of tables which will be excluded from copy is not big you can CTAS them with BACKUP NO option, thus they will not be included when you create a snapshot.
To me, option 1 looks the easiest, let me know if you need any help with that.
UPDATE :
Here is the SQL to generate UNLOAD statements
select 'unload (''select * from '||n.nspname||'.'||c.relname||''') to ''s3_location''
access_key_id ''accesskey''
secret_access_key ''secret_key''
delimiter ''your_delimiter''
PARALLEL ON
GZIP ;' as sql
from pg_class c
left join pg_namespace n on c.relnamespace=n.oid
where n.nspname in ('schema1','schema2');
If you like to add an additional filter for tables use c.relname column
I agree with solution provided by #mdem7. I would like to provide bit different solution that I feel may be helpful to others.
There are two problems,
Copying the schema and table definition(meaning DDL)
Copying data
Here is my proposed solution,
Copying the schema and table definition(meaning DDL)
I think, pg_dump command suit best here and it will export full schema definition in SQL file that could directly imported to another cluster.
pg_dump --schema-only -h your-host -U redshift-user -d redshift-database -p port > your-schema-file.sql
Then import the same to other cluster.
psql -h your-other-cluster-host -U other-cluster-username -d your-other-cluster-database-name -a -f your-schema-file.sql
Copying data
As suggested in other answer, unload to S3 and Copy from S3 suits best.
Hope it helps.
You really only have two options -
What mdem7 suggested using UNLOAD/COPY.
I don't recommend using pg_dump to get schema as it will miss the Redshift specific table settings like DIST/SORT keys + column ENCODING.
Check out this view instead - Generate Table DDL
The alternative is what you mentioned the restore from a snapshot (manual or automated). However, the moment the new cluster comes online (while it's still restoring), log in and drop (with cascade) all the schemas you do not want. This will stop the restore on the dropped schemas/tables. The only downside to this approach is that the new cluster needs to be the same size as the original. Which may or may not matter. If the cluster is going to be relatively long lived once it's restored and it makes sense you can resize it downwards after the restore has completed.