Copy data from Amazon S3 to Redshift and avoid duplicate rows

Copy data from Amazon S3 to Redshift and avoid duplicate rows - amazon-web-services

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don't have any unique constraints on my Redshift table. Is there a way to implement this using the copy command?
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
I tried adding unique constraint and setting column as primary key with no luck. Redshift does not seem to support unique/primary key constraints.

As user1045047 mentioned, Amazon Redshift doesn't support unique constraints, so I had been looking for the way to delete duplicate records from a table with a delete statement.
Finally, I found out a reasonable way.
Amazon Redshift supports creating an IDENTITY column that is stored an auto-generated unique number.
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
The following sql is for PostgreSQL to delete duplicated records with OID that is unique column, and you can use this sql by replacing OID with the identity column.
DELETE FROM duplicated_table WHERE OID > (
　SELECT MIN(OID) FROM duplicated_table d2
　　WHERE column1 = d2.dupl_column1
　　AND column2 = d2.column2
);
Here is an example that I tested on my Amazon Redshift cluster.
create table auto_id_table (auto_id int IDENTITY, name varchar, age int);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('John', 18);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Bob', 20);
insert into auto_id_table (name, age) values('Matt', 24);
select * from auto_id_table order by auto_id;
auto_id | name | age
---------+------+-----
1 | John | 18
2 | John | 18
3 | John | 18
4 | John | 18
5 | John | 18
6 | Bob | 20
7 | Bob | 20
8 | Matt | 24
(8 rows)
delete from auto_id_table where auto_id > (
select min(auto_id) from auto_id_table d
where auto_id_table.name = d.name
and auto_id_table.age = d.age
);
select * from auto_id_table order by auto_id;
auto_id | name | age
---------+------+-----
1 | John | 18
6 | Bob | 20
8 | Matt | 24
(3 rows)
Also it works with COPY command like this.
auto_id_table.csv
John,18
Bob,20
Matt,24
copy sql
copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ',';
The advantage of this way is that you don't need to run DDL statements. However it doesn't work with existing tables that do not have an identity column because an identity column cannot be added to an existing table. The only way to delete duplicated records with existing tables is migrating all records like this. (same as user1045047's answer)
insert into temp_table (select distinct from original_table);
drop table original_table;
alter table temp_table rename to original_table;

Mmm..
What about just never loading data into your master table directly.
Steps to avoid duplication:
begin transaction
bulk load into a temp staging table
delete from master table where rows = staging table rows
insert into master table from staging table (merge)
drop staging table
end transaction.
this is also super somewhat fast, and recommended by redshift docs.

My solution is to run a 'delete' command before 'copy' on the table. In my use case, each time I need to copy the records of a daily snapshot to redshift table, thus I can use the following 'delete' command to ensure duplicated records are deleted, then run the 'copy' command.
DELETE from t_data where snapshot_day = 'xxxx-xx-xx';

Currently there is no way to remove duplicates from redshift. Redshift doesn't support primary key/unique key constraints, and also removing duplicates using row number is not an option (deleting rows with row number greater than 1) as the delete operation on redshift doesn't allow complex statements (Also the concept of row number is not present in redshift).
The best way to remove duplicates is to write a cron/quartz job that would select all the distinct rows, put them in a separate table and then rename the table to your original table.
Insert into temp_originalTable (Select Distinct from originalTable)
Drop table originalTable
Alter table temp_originalTable rename to originalTable

There's another solution to really avoid data duplication although it's not as straightforward as removing duplicated data once inserted.
The copy command has the manifest option to specify which files you want to copy
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
you can build a lambda that generates a new manifest file every time before you run the copy command. That lambda will compare the files already copied with the new files arrived and will create a new manifest with only the new files so that you will never ingest the same file twice

We remove duplicates weekly, but you could also do this during the load transaction as mentioned by #Kyle. Also, this does require the existence of an autogenerated ID column as an eventual target of the delete :
DELETE FROM <your table> WHERE ID NOT IN (
SELECT ID FROM (
SELECT *, ROW_NUMBER() OVER
( PARTITION BY <your constraint columns> ORDER BY ID ASC ) DUPLICATES
FROM REQUESTS
) WHERE DUPLICATES=1
); COMMIT;
for example:
CREATE TABLE IF NOT EXISTS public.requests
(
id BIGINT NOT NULL DEFAULT "identity"(1, 0, '1,1'::text) ENCODE delta
kaid VARCHAR(50) NOT NULL
,eid VARCHAR(50) NOT NULL ENCODE text32k
,aid VARCHAR(100) NOT NULL ENCODE text32k
,sid VARCHAR(100) NOT NULL ENCODE zstd
,rid VARCHAR(100) NOT NULL ENCODE zstd
,"ts" TIMESTAMP WITHOUT TIME ZONE NOT NULL ENCODE delta32k
,rtype VARCHAR(50) NOT NULL ENCODE bytedict
,stype VARCHAR(25) ENCODE bytedict
,sver VARCHAR(50) NOT NULL ENCODE text255
,dmacd INTEGER ENCODE delta32k
,reqnum INTEGER NOT NULL ENCODE delta32k
,did VARCHAR(255) ENCODE zstd
,"region" VARCHAR(10) ENCODE lzo
)
DISTSTYLE EVEN
SORTKEY (kaid, eid, aid, "ts")
;
. . .
DELETE FROM REQUESTS WHERE ID NOT IN (
SELECT ID FROM (
SELECT *, ROW_NUMBER() OVER
( PARTITION BY DID,RID,RTYPE,TS ORDER BY ID ASC ) DUPLICATES
FROM REQUESTS
) WHERE DUPLICATES=1
); COMMIT;

Related

How to avoid error "Cannot insert rows out of order" in QuestDB?

I'm trying to migrate data to QuestDB and inserting historical records, I create table as
create table records(
type INT,
interval INT,
timestamp TIMESTAMP,
name STRING) timestamp(timestamp)
and insert data from CSV by curl uploading it.
I get back error "Cannot insert rows out of order". I read that out of order was supported in QuestDB but somehow I cannot make it work.

You can insert rows out of order on partitioned tables only, create new partitioned table and copy data into it
create table records2(
type INT,
interval INT,
timestamp TIMESTAMP,
name STRING
)
timestamp(timestamp) partition by DAY
insert into records2
select * from records
drop table records
rename table records2 to records
After this you'll be able to insert out of order into table records

SQLite how to limit the number of records

I want to limit the number of records in my SQLite table to for example 100 records, and then when I INSERT the 101th record, the first record (the oldest) be removed from the table. In other word, I want to prevent the table from growing more than 100 records and always have the last 100 records. Is there any setting or query with SQLite or should I handle it manually?
thanks in advance

You can do it with a trigger.
Say that your table is this:
CREATE TABLE tablename (
id INTEGER PRIMARY KEY,
name TEXT,
inserted_at TEXT DEFAULT (strftime('%Y-%m-%d %H:%M:%f', 'now'))
);
In the column inserted_at you will have the timestamp of the insertion of each row.
This is not necessary if you declared the column id as:
id INTEGER PRIMARY KEY AUTOINCREMENT
because in this case you could identify the 1st inserted row by the minimum value of the id.
Now create this trigger:
CREATE TRIGGER keep_100_rows AFTER INSERT ON tablename
WHEN (SELECT COUNT(*) FROM tablename) > 100
BEGIN
DELETE FROM tablename
WHERE id = (SELECT id FROM tablename ORDER BY inserted_at, id LIMIT 1);
-- or if you define id as AUTOINCREMENT
-- WHERE id = (SELECT MIN(id) FROM tablename);
END;
END;
Every time that you insert a new row, the trigger will check if the table has more than 100 rows and if it does it will delete the 1st inserted row.
See the demo (for max 3 rows).

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help

Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'

Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)

Updating Records Via RowVersion , using 'SQL WHERE' to filter for a MAX Value

Trying to update a table based off a RowVersion value in existing table. My data lake updates once a week , with new data stored as a .json file, which holds any new RowVersions.
I need to:
1)Query the existing table in my data warehouse to find the most up to date RowVersion( ie max)
2)Use that value to only filter/select the records in my data warehouse that are greater than the RowVersion I just identified
3)Update my table to include the new Rows
My Question is - the SQL Below, I am not sure how to select the Max RowNumber in the current table and then use that to filter/specify what I want returned when querying my S3 Bucket:
create or replace temporary table UPDATE_CAR_SALES AS
SELECT
VALUE:CAR::string AS CARS,
VALUE:RowVersion::INT AS ROW_VERSION
having row_version > max(row_version)
from '#s3_bucket',
lateral flatten( input => $1:value);

It's not clear to me how you store the data. Is the CARS column unique? Do you need to find maximum row version for each car or for all cars/rows? Anyway you can use a sub-query to filter the rows having row version is higher than the max value:
create or replace temporary table UPDATE_CAR_SALES AS
SELECT
VALUE:CAR::string AS CARS,
VALUE:RowVersion::INT AS ROW_VERSION
FROM #s3_bucket, lateral flatten( input => $1 )
where ROW_VERSION > (SELECT MAX(RowVersion)
from MAIN_TABLE);
If you need to filter the rows, based on row version of each car (of the existing table):
create or replace temporary table UPDATE_CAR_SALES AS
SELECT * FROM (SELECT
VALUE:CAR::string AS CARS,
VALUE:RowVersion::INT AS ROW_VERSION
FROM #s3_bucket, lateral flatten( input => $1 )) temp_table
where temp_table.ROW_VERSION > (SELECT MAX(RowVersion)
from MAIN_TABLE where cars = temp_table.CARS );
I needed to put the main query in brackets to be able to use alias. Hope it helps.

Alter column data type in Amazon Redshift

How to alter column data type in Amazon Redshift database?
I am not able to alter the column data type in Redshift; is there any way to modify the data type in Amazon Redshift?

As noted in the ALTER TABLE documentation, you can change length of VARCHAR columns using
ALTER TABLE table_name
{
ALTER COLUMN column_name TYPE new_data_type
}
For other column types all I can think of is to add a new column with a correct datatype, then insert all data from old column to a new one, and finally drop the old column.
Use code similar to that:
ALTER TABLE t1 ADD COLUMN new_column ___correct_column_type___;
UPDATE t1 SET new_column = column;
ALTER TABLE t1 DROP COLUMN column;
ALTER TABLE t1 RENAME COLUMN new_column TO column;
There will be a schema change - the newly added column will be last in a table (that may be a problem with COPY statement, keep that in mind - you can define a column order with COPY)

to avoid the schema change mentioned by Tomasz:
BEGIN TRANSACTION;
ALTER TABLE <TABLE_NAME> RENAME TO <TABLE_NAME>_OLD;
CREATE TABLE <TABLE_NAME> ( <NEW_COLUMN_DEFINITION> );
INSERT INTO <TABLE_NAME> (<NEW_COLUMN_DEFINITION>)
SELECT <COLUMNS>
FROM <TABLE_NAME>_OLD;
DROP TABLE <TABLE_NAME>_OLD;
END TRANSACTION;

(Recent update) It's possible to alter the type for varchar columns in Redshift.
ALTER COLUMN column_name TYPE new_data_type
Example:
CREATE TABLE t1 (c1 varchar(100))
ALTER TABLE t1 ALTER COLUMN c1 TYPE varchar(200)
Here is the documentation link

If you don't want to change the column order, an option will be creating a temp table, drop & create the new one with desired size and then bulk again the data.
CREATE TEMP TABLE temp_table AS SELECT * FROM original_table;
DROP TABLE original_table;
CREATE TABLE original_table ...
INSERT INTO original_table SELECT * FROM temp_table;
The only problem recreating the table is that you will need to grant again permissions and if the table is too bigger it will take a piece of time.

ALTER TABLE publisher_catalogs ADD COLUMN new_version integer;
update publisher_catalogs set new_version = CAST(version AS integer);
ALTER TABLE publisher_catalogs DROP COLUMN version RESTRICT;
ALTER TABLE publisher_catalogs RENAME new_version to version;

Redshift being columnar database doesn't allow you to modify the datatype directly,
however below is one approach this will change the column order.
Steps -
1.Alter table add newcolumn to the table
2.Update the newcolumn value with oldcolumn value
3.Alter table to drop the oldcolumn
4.alter table to rename the columnn to oldcolumn
If you don't want to alter the order of the columns then solution would be to
1.create temp table with new column name
copy data from old table to new table.
drop old table
rename the newtable to oldtable
One important thing create a new table using like command instead simple create.

This method works for converting an (big) int column into a varchar
-- Create a backup of the original table
create table original_table_backup as select * from original_table;
-- Drop the original table, and then recreate with new desired data types
drop table original_table;
create table original_table (
col1 bigint,
col2 varchar(20) -- changed from bigint
);
-- insert original entries back into the new table
insert into original_table select * from original_table_backup;
-- cleanup
drop original_table_backup;

You can use the statements below:
ALTER TABLE <table name --etl_proj_atm.dim_card_type >
ALTER COLUMN <col name --card_type> type varchar(30)

UNLOAD and COPY with table rename strategy should be the most efficient way to do this operation if retaining the table structure(row order) is important.
Here is an example adding to this answer.
BEGIN TRANSACTION;
ALTER TABLE <TABLE_NAME> RENAME TO <TABLE_NAME>_OLD;
CREATE TABLE <TABLE_NAME> ( <NEW_COLUMN_DEFINITION> );
UNLOAD ('select * from <TABLE_NAME>_OLD') TO 's3://bucket/key/unload_' manifest;
COPY <TABLE_NAME> FROM 's3://bucket/key/unload_manifest'manifest;
END TRANSACTION;

for updating the same column in redshift this would work fine
UPDATE table_name
SET column_name = 'new_value' WHERE column_name = 'old_value'
you can have multiple clause in where by using and, so as to remove any confusion for sql
cheers!!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Copy data from Amazon S3 to Redshift and avoid duplicate rows - amazon-web-services

Related

How to avoid error "Cannot insert rows out of order" in QuestDB?

SQLite how to limit the number of records

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Updating Records Via RowVersion , using 'SQL WHERE' to filter for a MAX Value

Alter column data type in Amazon Redshift

Categories

Resources