Redshift DEFAULT GETDATE() working on INSERT but not COPY

Redshift DEFAULT GETDATE() working on INSERT but not COPY - amazon-web-services

I have a column with a default constraint in my Redshift table so that the current timestamp will be populated for it.
CREATE TABLE test_table(
...
etl_date_time timestamp DEFAULT GETDATE(),
...
);
This works as expected on INSERTS, but I still get null values when copying a json file from S3 that has no key for this column
COPY test_table FROM 's3://bucket/test_file.json'
CREDENTIALS '...' FORMAT AS JSON 'auto';
// There shouldn't be any NULLs here, but there are
select count(*) from test_table where etl_date_time is null;
I have also tried putting a null value for the key in the source JSON, but that resulted in NULL values in the table as well.
{
...
"etl_date_time": null,
...
}

If the field is always NULL, consider omitting it from the files at S3 at all. COPY let's you specify the columns you intend to copy and will populate missing ones with their DEFAULT values.
So for the file data.json:
{"col1":"r1_val1", "col3":"r1_val2"}
{"col1":"r2_val1", "col3":"r2_val2"}
And the table definition:
create table _test (
col1 varchar(20)
, col2 timestamp default getdate()
, col3 varchar(20)
);
Specific column names
The COPY command with explicit column names
copy _test(col1,col3) from 's3://bucket/data.json' format as json 'auto'
Would yield the following result:
db=# select * from _test;
col1 | col2 | col3
---------+---------------------+---------
r1_val1 | 2016-07-27 18:27:08 | r1_val2
r2_val1 | 2016-07-27 18:27:08 | r2_val2
(2 rows)
Omitted column names
If the column names are omitted,
copy _test from 's3://bucket/data.json' format as json 'auto'
Would never use the DEFAULT but insert NULL instead:
db=# select * from _test;
col1 | col2 | col3
---------+---------------------+---------
r1_val1 | | r1_val2
r2_val1 | | r2_val2
(2 rows)

Related

How to update multiple columns in same update statement with one column depends upon another new column new value in Redshift

I want to update multiple columns in same update statement with one column depends upon another new column new value.
Example:
Sample Data: col1 and col2 is the column names and test_update is the table name.
SELECT * FROM test_update;
col1 col2
col-1 col-2
col-1 col-2
col-1 col-2
update test_update set col1 = 'new', col2=col1||'-new';
SELECT * FROM test_update;
col1 col2
new col-1-new
new col-1-new
new col-1-new
What I need to achieve is col2 is updated as new-new as we updated value of col1 is new.
I think may be its not possible in one SQL statement. If possible How can we do that, If its not What is best way of handling this problem in Data Warehouse environment, like execute multiple update 1st on col1 and then on col2 or any other.
Hoping my question is clear.

You cannot update the second column based on the result of updating the first column. However this can be achieved in a single by "pre-calculating" the result you want and then updating based on that.
The following update using a join is based on the example provided in the Redshift documentation:
UPDATE test_update
SET col1 = precalc.col1
, col2 = precalc.col2
FROM (
SELECT catid
, 'new' AS col1
, col1 || '-new' AS col2
FROM test_update
) precalc
WHERE test_update.id = precalc.id;
;

Oracle 18c - Alternative to REGEXP_REPLACE

After migrating to Oracle 18c Enterprise Edition, a function based index fails to create.
Here is my index DDL:
CREATE INDEX my_index ON my_table
(UPPER( REGEXP_REPLACE ("DEPT_NUM",'[^[:alnum:]]',NULL,1,0)))
TABLESPACE my_tbspace
PCTFREE 10
INITRANS 2
MAXTRANS 255
STORAGE (
INITIAL 64K
MINEXTENTS 1
MAXEXTENTS UNLIMITED
PCTINCREASE 0
BUFFER_POOL DEFAULT
);
I get the following error:
ORA-01743: only pure functions can be indexed
01743. 00000 - "only pure functions can be indexed"
*Cause: The indexed function uses SYSDATE or the user environment.
*Action: PL/SQL functions must be pure (RNDS, RNPS, WNDS, WNPS). SQL
expressions must not use SYSDATE, USER, USERENV(), or anything
else dependent on the session state. NLS-dependent functions
are OK.
Is this a known bug in 18c? If this function based index is no longer supported, what is another way to write this function?

The issue is regexp_replace is not deterministic. The problem arises when changing NLS settings:
alter session set nls_language = english;
with rws as (
select 'STÜFF' v
from dual
)
select regexp_replace ( v, '[A-Z]+', '#' )
from rws;
REGEXP_REPLACE(V,'[A-Z]+','#')
#Ü#
alter session set nls_language = german;
with rws as (
select 'STÜFF' v
from dual
)
select regexp_replace ( v, '[A-Z]+', '#' )
from rws;
REGEXP_REPLACE(V,'[A-Z]+','#')
#
U-umlaut is at the end of the alphabet in English. But after U in German. So the first statement doesn't replace it. The second does.
In Oracle Database 12.1 and earlier regexp_replace was incorrectly marked as deterministic. 12.2 fixed this by making it non-deterministic.
Consider carefully whether any workarounds manage diacritics correctly.
MOS note 2592779.1 discusses this further.

Most likely the REGEXP_REPLACE causes the problem, see Find out if a string contains only ASCII characters. You can bypass the limitation with a user defined function (thanks to Bob Jarvis)
CREATE OR REPLACE FUNCTION KEEP_ALNUM(strIn IN VARCHAR2)
RETURN VARCHAR2
DETERMINISTIC
AS
BEGIN
RETURN UPPER(REGEXP_REPLACE(strIn, '[^[:alnum:]]', NULL, 1, 0));
END KEEP_ALNUM;
/
CREATE INDEX DEPTS_1 ON DEPTS(KEEP_ALNUM(DEPT_NUM));
Just ensure function has keyword DETERMINISTIC, then you can define even useless functions like below and create a functional index on it
CREATE OR REPLACE FUNCTION SillyValue RETURN VARCHAR2 DETERMINISTIC
AS
BEGIN
RETURN DBMS_RANDOM.STRING('p', 20);
END;
/

There are a couple of workarounds.
First one is a hack.
As you may know, when you create FBI then Oracle creates hidden column and index on it.
Moreover, you even can specify the name of that column instead of FBI expression and Oracle will use an index.
set lines 70 pages 70
column column_name format a15
column data_type format a15
drop table my_table;
create table my_table(dept_num, dept_descr) as select rownum||'*', 'dummy' from dual connect by level <= 1e6;
create index my_index
on my_table(upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0)));
select column_name, data_type from user_tab_cols where table_name = 'MY_TABLE';
explain plan for
select * from my_table where upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0)) = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
explain plan for
select * from my_table where SYS_NC00003$ = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
Output
Table dropped.
Table created.
Index created.
COLUMN_NAME DATA_TYPE
--------------- ---------------
DEPT_NUM VARCHAR2
DEPT_DESCR CHAR
SYS_NC00003$ VARCHAR2
3 rows selected.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
So to mimic FBI you can create a hidden column and an index on top of it.
That can be done in Oracle 11g using dbms_stats.create_extended_stats.
drop index my_index;
begin
for i in (select dbms_stats.create_extended_stats
(user, 'my_table', '(upper(regexp_replace("DEPT_NUM", ''[^[:alnum:]]'', null, 1, 0)))') as col_name
from dual)
loop
execute immediate(utl_lms.format_message('alter table %s rename column "%s" to my_hidden_col','my_table', i.col_name));
end loop;
end;
/
select column_name, data_type from user_tab_cols where table_name = 'MY_TABLE';
create index my_index on my_table(my_hidden_col);
explain plan for
select * from my_table where upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0)) = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
explain plan for
select * from my_table where MY_HIDDEN_COL = '666';
select * from table(dbms_xplan.display(format => 'BASIC'));
Output
Index dropped.
PL/SQL procedure successfully completed.
COLUMN_NAME DATA_TYPE
--------------- ---------------
DEPT_NUM VARCHAR2
DEPT_DESCR CHAR
MY_HIDDEN_COL VARCHAR2
3 rows selected.
Index created.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
Explain complete.
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------
Plan hash value: 2234884270
--------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| MY_TABLE |
| 2 | INDEX RANGE SCAN | MY_INDEX |
--------------------------------------------------------
9 rows selected.
Starting with Oracle 12c hidden columns are documented so it becomes even more straightforward.
alter table my_table add (my_hidden_col invisible as
(upper(regexp_replace(dept_num, '[^[:alnum:]]', null, 1, 0))) virtual);
create index my_index on my_table(my_hidden_col);
Another approach is to implement the same logic without a regex.
create index my_index on my_table(
translate(upper(dept_num, '_'||translate(dept_num, '_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789', '_'), '_')));
But in this case you have to make sure that all expressions with regex in predicates are replaced with the new one.

The work-around I found easiest was to create the index using NLS_UPPER instead of UPPER:
CREATE INDEX my_index ON my_table
( REGEXP_REPLACE (NLS_UPPER("DEPT_NUM"),'[^[:alnum:]]',NULL,1,0)))
TABLESPACE my_tbspace
PCTFREE 10
INITRANS 2
MAXTRANS 255
STORAGE (
INITIAL 64K
MINEXTENTS 1
MAXEXTENTS UNLIMITED
PCTINCREASE 0
BUFFER_POOL DEFAULT
);

Choose DAX measure based on slicer value

Is it possible to dynamically pick up appropriate DAX measure defined in a table by slicer value?
Source table:
+----------------+------------+
| col1 | col2 |
+----------------+------------+
| selectedvalue1 | [measure1] |
| selectedvalue2 | [measure2] |
| selectedvalue3 | [measure3] |
+----------------+------------+
The values of col1 I put into slicer. I can retrieve these values by:
SlicerValue = SELECTEDVALUE(tab[col1])
I could hard code:
MyVariable = SWITCH(TRUE(),
SlicerValue = "selectedvalue1" , [measure1],
SlicerValue = "selectedvalue2" , [measure2],
SlicerValue = "selectedvalue3" , [measure3],
BLANK()
)
But I do not want to hard code the relation SelectedValue vs Measure in DAX measure. I want to have it defined in the source table.
I need something like this:
MyMeasure = GETMEASURE(tab[col2])
Of course assuming that such a function exists and that only one value of col2 has been filtered.

#NickKrasnov mentioned calculation groups elsewhere. To automate the generation of your hard-coded lookup table, you could use DMVs against your pbix.
You might do something like below to get output formatted that can be pasted into a large SWITCH.
SELECT
'"' + [Name] + '", [' + [Name] + '],'
FROM $SYSTEM.TMSCHEMA_MEASURES

Inputting missing value in primary dataset based on values in secondary dataset and a matching condition

my understanding of SAS is very elementary. I am trying to do something like this and i need help.
I have a primary dataset A with 20,000 observations where Col1 stores the CITY and Col2 stores the MILES. Col2 contains a lot of missing data. Which is as shown below.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Gary,IN | 242.34 |
+----------------+---------------+
| Lafayette,OH | . |
+----------------+---------------+
| Ames, IA | 123.19 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | . |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
I have another secondary dataset B this has around 5000 observations and very similar to dataset A where Col1 stores the CITY and Col2 stores the MILES. However in this dataset B, Col2 DOES NOT CONTAIN MISSING DATA.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Lafayette,OH | 321.45 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | 176.34 |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
My goal is to fill the missing miles in Dataset A based on the miles in Dataset B by matching the city names in col1.
In this example, I am trying to fill in 321.45 in Dataset A from Dataset B and similarly 176.34 by matching Col1 (city names) between the two datasets.
I am need help doing this in SAS

You just have to merge the two datasets. Note that values of Col1 needs to match exactly in the two datasets.
Also, I am assuming that Col1 is unique in dataset B. Otherwise you need to somehow tell more exactly what value you want to use or remove the duplicates (for example by adding nodupkey in proc sort statement).
Here is an example how to merge in SAS:
proc sort data=A;
by Col1;
proc sort data=B;
by Col1;
data AB;
merge A(in=a) B(keep=Col1 Col2 rename=(Col2 = Col2_new));
by Col1;
if a;
if missing(Col2) then Col2 = Col2_new;
drop Col2_new;
run;
This includes all observations and columns from dataset A. If Col2 is missing in A then we use the value from B.

Pekka's solution is perfectly working, I add an alternative solution for the sake of completeness.
Sometimes in SAS a PROC SQL lets you skip some passages compared to a DATA step (with the relative gain in storage resources and computational time), and a MERGE is a typical example.
Here you can avoid sorting both input datasets and handling the renaming of variables (here the matching key has the same name col1 but in general this is not the case).
proc sql;
create table want as
select A.col1,
coalesce(A.col2,B.col2) as col2
from A left join B
on A.col1=B.col1
order by A.col1;
quit;
The coalesce() function returns the first non missing element encountered in the arguments list.

Copy data from Amazon S3 to Redshift and avoid duplicate rows

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don't have any unique constraints on my Redshift table. Is there a way to implement this using the copy command?
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
I tried adding unique constraint and setting column as primary key with no luck. Redshift does not seem to support unique/primary key constraints.

As user1045047 mentioned, Amazon Redshift doesn't support unique constraints, so I had been looking for the way to delete duplicate records from a table with a delete statement.
Finally, I found out a reasonable way.
Amazon Redshift supports creating an IDENTITY column that is stored an auto-generated unique number.
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
The following sql is for PostgreSQL to delete duplicated records with OID that is unique column, and you can use this sql by replacing OID with the identity column.
DELETE FROM duplicated_table WHERE OID > (
　SELECT MIN(OID) FROM duplicated_table d2
　　WHERE column1 = d2.dupl_column1
　　AND column2 = d2.column2
);
Here is an example that I tested on my Amazon Redshift cluster.
create table auto_id_table (auto_id int IDENTITY, name varchar, age int);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Matt', 24);
select * from auto_id_table order by auto_id;
auto_id | name | age
---------+------+-----
1 | John | 18
2 | John | 18
3 | John | 18
4 | John | 18
5 | John | 18
6 | Bob | 20
7 | Bob | 20
8 | Matt | 24
(8 rows)
delete from auto_id_table where auto_id > (
select min(auto_id) from auto_id_table d
where auto_id_table.name = d.name
and auto_id_table.age = d.age
);
select * from auto_id_table order by auto_id;
auto_id | name | age
---------+------+-----
1 | John | 18
6 | Bob | 20
8 | Matt | 24
(3 rows)
Also it works with COPY command like this.
auto_id_table.csv
John,18
Bob,20
Matt,24
copy sql
copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ',';
The advantage of this way is that you don't need to run DDL statements. However it doesn't work with existing tables that do not have an identity column because an identity column cannot be added to an existing table. The only way to delete duplicated records with existing tables is migrating all records like this. (same as user1045047's answer)
insert into temp_table (select distinct from original_table);
drop table original_table;
alter table temp_table rename to original_table;

Mmm..
What about just never loading data into your master table directly.
Steps to avoid duplication:
begin transaction
bulk load into a temp staging table
delete from master table where rows = staging table rows
insert into master table from staging table (merge)
drop staging table
end transaction.
this is also super somewhat fast, and recommended by redshift docs.

My solution is to run a 'delete' command before 'copy' on the table. In my use case, each time I need to copy the records of a daily snapshot to redshift table, thus I can use the following 'delete' command to ensure duplicated records are deleted, then run the 'copy' command.
DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

Currently there is no way to remove duplicates from redshift. Redshift doesn't support primary key/unique key constraints, and also removing duplicates using row number is not an option (deleting rows with row number greater than 1) as the delete operation on redshift doesn't allow complex statements (Also the concept of row number is not present in redshift).
The best way to remove duplicates is to write a cron/quartz job that would select all the distinct rows, put them in a separate table and then rename the table to your original table.
Insert into temp_originalTable (Select Distinct from originalTable)
Drop table originalTable
Alter table temp_originalTable rename to originalTable

There's another solution to really avoid data duplication although it's not as straightforward as removing duplicated data once inserted.
The copy command has the manifest option to specify which files you want to copy
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
you can build a lambda that generates a new manifest file every time before you run the copy command. That lambda will compare the files already copied with the new files arrived and will create a new manifest with only the new files so that you will never ingest the same file twice

We remove duplicates weekly, but you could also do this during the load transaction as mentioned by #Kyle. Also, this does require the existence of an autogenerated ID column as an eventual target of the delete :
DELETE FROM <your table> WHERE ID NOT IN (
SELECT ID FROM (
SELECT *, ROW_NUMBER() OVER
( PARTITION BY <your constraint columns> ORDER BY ID ASC ) DUPLICATES
FROM REQUESTS
) WHERE DUPLICATES=1
); COMMIT;
for example:
CREATE TABLE IF NOT EXISTS public.requests
(
id BIGINT NOT NULL DEFAULT "identity"(1, 0, '1,1'::text) ENCODE delta
kaid VARCHAR(50) NOT NULL
,eid VARCHAR(50) NOT NULL ENCODE text32k
,aid VARCHAR(100) NOT NULL ENCODE text32k
,sid VARCHAR(100) NOT NULL ENCODE zstd
,rid VARCHAR(100) NOT NULL ENCODE zstd
,"ts" TIMESTAMP WITHOUT TIME ZONE NOT NULL ENCODE delta32k
,rtype VARCHAR(50) NOT NULL ENCODE bytedict
,stype VARCHAR(25) ENCODE bytedict
,sver VARCHAR(50) NOT NULL ENCODE text255
,dmacd INTEGER ENCODE delta32k
,reqnum INTEGER NOT NULL ENCODE delta32k
,did VARCHAR(255) ENCODE zstd
,"region" VARCHAR(10) ENCODE lzo
)
DISTSTYLE EVEN
SORTKEY (kaid, eid, aid, "ts")
;
. . .
DELETE FROM REQUESTS WHERE ID NOT IN (
SELECT ID FROM (
SELECT *, ROW_NUMBER() OVER
( PARTITION BY DID,RID,RTYPE,TS ORDER BY ID ASC ) DUPLICATES
FROM REQUESTS
) WHERE DUPLICATES=1
); COMMIT;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Redshift DEFAULT GETDATE() working on INSERT but not COPY - amazon-web-services

Related

How to update multiple columns in same update statement with one column depends upon another new column new value in Redshift

Oracle 18c - Alternative to REGEXP_REPLACE

Choose DAX measure based on slicer value

Inputting missing value in primary dataset based on values in secondary dataset and a matching condition

Copy data from Amazon S3 to Redshift and avoid duplicate rows

Categories

Resources