Is it possible to use the Redshift UNLOAD command in a stored procedure loop to:
UNLOAD query dependent on a variable
Define the S3 path dependent on a variable
I have been experimenting with a contrived example but don't seem to be able to get it to work..
CREATE OR REPLACE PROCEDURE p_shoes()
LANGUAGE plpgsql
AS $$
DECLARE
shoe_record RECORD;
BEGIN
FOR shoe_record IN EXECUTE 'SELECT * FROM shoes' LOOP
UNLOAD('SELECT * FROM shoes JOIN shoetypes ON shoetypes.shoetype = ''' || shoe_record.shoetype || '''')
TO 's3://some-bucket/prefix/' || shoe_record.shoetype;
END LOOP;
RETURN;
END;
$$;
You can use the EXECUTE command to run any string as a command.
So, you can put the UNLOAD command into a string (varchar), modify the value of interest and then EXECUTE the command.
See: Supported PL/pgSQL Statements - Amazon Redshift
Related
I have a file which contains holidays and it is required by UDF to use this file to run and calculate the given business days for two dates. Issue I have is when I add the file, it goes to a working directory but this directory differs every session.
Unlike in the example below from Hive Resources - This is not what is happening.
hive> add FILE /tmp/tt.py;
hive> list FILES;
/tmp/tt.py
hive> select from networks a
MAP a.networkid
USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;
This is what I am getting and the alpha numeric keeps changing.
/mnt/tmp/a17b43d5-df53-4eea-8e2c-565471b49d25_resources/holiday2021.csv
I need to make this file located in a more permanent folder and this hive sql can be executed into any of the 18 nodes.
I have job in Redshift that is responsible for pulling 6 files every month from S3. File names follow a standard naming convention as "file_label_MonthNameYYYY_Batch01.CSV". I'd like to modify the below COPY command to change the file naming in the S3 directory dynamically so I won't have to hard code the Month Name and YYYY and batch number. Batch number ranges 1-6.
Currently, here is what I have which is not efficient:
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch01.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch02.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL
BLANKSASNULL
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;
The dynamic file name shall change to August2021_Batch01 & August2021_Batch02 next month and so forth. Is there a way to do this? Thank you in advance.
There are lots of approaches to this. Which one is best for your case will depend on your circumstances. You need a layer in your process that controls configuring SQL for each month. Here are some ways to consider:
Use a manifest file - This file will have the S3 object names to
load. Your processing / file prep can update this file
Use a fixed load folder where the files are located for COPY, then
move these files to perm storage location after COPY.
Use variables in you bench to set the Month value and replace this
in when the SQL is issued to Redshift.
Write some code (Lambda?) to issue the SQL you are looking for
Last I checked you could leave the object name incomplete and all
matching objects would be loaded. Leave off the batch number and
suffix and load all the files with one text change.
It is desirable to load multiple files with a COPY command (uses more nodes in parallel) and options 1, 2, and 5 do this.
When specifying the FROM location of files to load, you can specify a partial filename.
Here is an example from COPY examples - Amazon Redshift:
The following example loads the SALES table with tab-delimited data from lzop-compressed files in an Amazon EMR cluster. COPY loads every file in the myoutput/ folder that begins with part-.
copy sales
from 'emr://j-SAMPLE2B500FC/myoutput/part-*'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
Therefore, you could specify:
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_*'
You would just need to change the Month & Year identifier. All files with that prefix would be loaded in one batch.
Is there any way to provide a suffix for paths when doing a partitioned unload to S3?
e.g. if I want to use the output of +several+ queries for batch jobs, where query outputs are partitioned by date.
Currently I have a structure in S3 like:
s3://bucket/path/queryA/key=1/ *.parquet
s3://bucket/path/queryA/key=2/ *.parquet
s3://bucket/path/queryB/key=1/ *.parquet
s3://bucket/path/queryB/key=2/ *.parquet
But ideally, I would like to have:
s3://bucket/path/key=1/queryA/ *.parquet
s3://bucket/path/key=2/queryA/ *.parquet
s3://bucket/path/key=1/queryB/ *.parquet
s3://bucket/path/key=2/queryB/ *.parquet
So that I can then use as input paths to batch processing jobs (e.g. on Sagemaker!):
s3://bucket/path/key=1/
s3://bucket/path/key=2/
Such that each batch job has the output of all queries for the particular day that the batch job is computing for.
Currently, I re-shape the data in S3 after unloading but it would be much faster and more convenient if I could specify a suffix for Redshift to append to S3 unload paths, +after+ the partition suffix.
From the UNLOAD docs I'm assuming that this isn't possible, and I'm unable to post on AWS forums.
But perhaps there's some other command or a connection variable that I can use, a hack involving something like a literal value for a second partition key, or a totally different strategy altogether?
You could add an artificial column q to mark the query, and then use it as a second partition - that would effectively add a q=queryA prefix to your path.
BUT, redshift does not allow to UNLOAD into a non-empty location, unless you provide an ALLOWOVERWRITE option.
Then, since you don't control the unloaded filenames (they'll depend on the slice count and max file size) allowing overwrite may cause your data to really be overwritten if you happen to have same partition keys.
To work around that, you could add one more artificial partitioning column which would add a unique component to your path (same value for each unload). I used RANDOM in my example for that - you could use something which is more clash-safe.
Below is an example query, which unloads data without overwriting results even if unloaded multiple times. I ran it for different part and q values.
unload ($$
WITH
rand(rand) as (select md5(random())),
input(val, part) as (
select 1, 'p1' union all
select 1, 'p2'
)
SELECT
val,
part,
'queryB' as q,
rand as r
FROM input, rand
$$)
TO 's3://XXX/partitioned_unload/'
IAM_ROLE 'XXX'
PARTITION by (part, q, r)
ALLOWOVERWRITE
These are the files produced by 3 runs:
aws s3 ls s3://XXX/partitioned_unload/ --recursive
2020-06-29 08:29:14 2 partitioned_unload/part=p1/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0001_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p1/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p1/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:54 2 partitioned_unload/part=p2/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:14 2 partitioned_unload/part=p3/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0002_part_00
2020-06-29 08:28:58 2 partitioned_unload/part=p3/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
I am trying to run this simple code in SSMS connected to Azure SQL DW
and it fails. I have tried some different variation but none of them seems to be working.
BEGIN
PRINT 'Hello ';
WAITFOR DELAY '00:00:02'
PRINT 'Another';
END
Msg 103010, Level 16, State 1, Line 47
Parse error at line: 2, column: 16: Incorrect syntax near ';'.
A bloody workaround until we have that simple built-in function:
1- Create a Proc Named "spWait" as follows:
CREATE PROC spWait #Seconds INT
AS
BEGIN
DECLARE #BEGIN DATETIME
DECLARE #END DATETIME
SET #BEGIN = GETDATE()
SET #END = DATEADD("SECOND",#Seconds,#BEGIN)
WHILE (#BEGIN<#END)
BEGIN
SET #BEGIN=GETDATE()
END
END
2- Call this between your commands
--Do this
EXEC spWait 3
--Do that
Correct. At the moment the WAITFOR statement isn't supported in Azure SQL DW. Note on the documentation for this statement near the top it says whether this statement "applies to" Azure SQL DW.
Please vote for this feature suggestion to help Microsoft prioritize this enhancement.
It may not help you much, but you can connect to the master database under a separate connection and run the WAITFOR statement.
I'd like to specify the current folder. I can find the current folder:
libname _dummy_ ".";
%let folder = %NRBQUOTE(%SYSFUNC(PATHNAME(_DUMMY_)));
%put &folder;
and change it manually by double clicking the current folder status bar, but I'd prefer to code it. Is this possible?
Like this:
x 'cd <full path>';
for example
x 'cd C:\Users\foo';
SAS recognizes that a change directory command was issued to the OS and changes it's current working directory.
Be aware that the timing of an X statement is like that of other global statements (title, footnote, options, etc). If it is placed within a DATA step, the X statement will be issued prior to the data step execution.
For example, supposing your current working directory is c:\temp. The following writes HelloWorld.txt to c:\temp2 rather than c:\temp. At compile time, SAS runs the X statement and then performs the data step. Note that in SAS, a period (.) is the reference to the current working directory.
data _null_;
file '.\HelloWorld.txt';
put 'Hello, world!';
x 'cd C:\temp2';
run;
To change directories after the data step has executed, you would want to use CALL SYSTEM. CALL statements execute conditionally by being called after a data step.
data _null_;
file '.\HelloWorld.txt';
put 'Hello, world!';
command = 'cd "C:\temp2"';
call system(command);
run;
More information about these kinds of details for Windows systems can be found in the Running Windows or MS-DOS Commands from within SAS