Amazon Redshift: The DB is overriding created_at values with its own - amazon-web-services

I'm using a copy command to load many files into the redshift DB. The redshift's own created_at is overriding the created_at timestamp specified in the json.
COPY test
FROM s3://test/test
credentials 'my credentials'
json 'auto';
An example would be:
The json being imported
{"foo":"bar", "created_at":"2018-09-05 17:48:34"}
This saves successfully in the DB, but the json timestamp is overwritten to the current time (ie 2018-09-10 16:00:28)
How can I make redshift respect the created_at times I am giving it?

Here is excerpt from Redshift official documents to handle column with Default Value.
If a column in the table is omitted from the column list, COPY will load the column with either the value supplied by the DEFAULT option that was specified in the CREATE TABLE command, or with NULL if the DEFAULT option was not specified.
So if you skip from column list, it will always save DEFAULT. And Default are only evaluated once, meaning all the rows will have same value.
I believe this must not be your case, the only possible culprit could be your json 'auto' which may be unintentionally making Redshift ignore created_at.
Then, if you specify the DEFAULT column in, it always load it from your data file, so if you don't that records, it will consider it as null and load as null. Doesn't apply the logic of DEFAULT. For example if your data is like--
{"foo":"bar", "created_at":"2018-09-05 17:48:34"}
{"foo":"bar1","created_at":""}
{"foo":"bar2"}
{"foo":"bar3","created_at":null}
It will be populated to database like below.
foo | created_at
------+---------------------
bar2 |
bar | 2018-09-05 17:48:34
bar1 |
bar1 |
(4 rows)
SO what options you have to handle this situation?
Go with second option, where you specify the column with default values and issue an update query immediacy after loading your data. e.g.
update foo set created_at= sysdate where created_at is null;
Please keep in mind, UPDATEs are costly operations in Redshift as its DELETE+INSERT. Then what else, if possible transform your data at the source if its not costly there Or do a comparison, where populating DEFAULT suites best in your case.
I hope it helps, if not, let me know via comment, I'll refocus the answer.

Related

Simultaneously `CREATE TABLE LIKE` in AWS Redshift and change a few of columns' default values

Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!

Google Bigquery: Join of two external tables fails if one of them is empty

I have 2 external tables in BiqQuery, created on top of JSON files on Google Cloud Storage. The first one is a fact table, the second is errors data - and it might or might not be empty.
I can query each table separately just fine, even an empty one - here is an
empty table query result example
I'm also able to left join them if both of them are not empty.
However, if errors table is empty, my query fails with the following error:
The query specified one or more federated data sources but not all of them were scanned. It usually indicates incorrect uri specification or a 'limit' clause over a union of federated data sources that was satisfied without having to read all sources.
This situation isn't covered anywhere in the docs, and it's not related to this versioning issue - Reading BigQuery federated table as source in Dataflow throws an error
I'd rather avoid converting either of this tables to native, since they are used in just one step of the ETL process, and this data is dropped afterwards. One of them being empty doesn't look like an exceptional situation, since plain select works just fine.
Is some workaround possible?
UPD: raised an issue with Google, waiting for response - https://issuetracker.google.com/issues/145230326
It feels like a bug. One workaround is to use scripting to avoid querying the empty table:
DECLARE is_external_table_empty BOOL DEFAULT
(SELECT 0 = (SELECT COUNT(*) FROM your_external_table));
-- do things differently when is_external_table_empty is true
IF is_external_table_empty = true
THEN ...
ELSE ...
END IF

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Error with upgrade codeunit when changing table's PK length

I have table A-Z. Table A has PK of ID, and all other tables has fields that relates to TableA's ID.
I'm being tasked to do code cleanup, and I need to change the TableA's ID from length 30 to 20. I have done for other table B-Z, together with the upgrade codeunit. But when I try to change for TableA, I get this error:
"The are changes related to the following primary key that can cause data loss in the new table. The changes cannot be handled because the TableUpgradeMode of the TableSyncSetup type function for the changed table is set to Copy, which does not copy data to the new table. To fix this issue, you must change the TableUpgradeMode option to Move, then add C/AL code to an Upgrade type function to handle new table data."
What does the error mean? Do I need to change TableA's upgrade codeunit from TableSyncSetup.Mode::Copy to ::Move? Any guidance?
I'm using Dynamics NAV 2016.
Yes, you have to change the mode to Move but you also have to create a new table which holds the data temporarily from the fields where you've reduced the field length. You also have to handle the possible data truncation issue because of the reduced field length.
But I would do this in a different way (the old way from the Upgrade Toolkits):
- Create a new table with the same field length (30), copy the field contents and clear the fields (using a codeunit)
- Change the field lengths and but choose Force when NAV is asking about the Sync Mode (because you know that there is no data in those fields - SQL can drop and recreate the columns)
- Using a second codeunit copy the data back into the reduced fields - handle the truncation
I hope it helps

What is the idiomatic way to perform a migration on a dynamo table

Suppose I have a dynamo table named Person, which has 2 fields, name (string), age (int). Let's assume it has a TB worth of data and experiences a small amount of read throughput, but a ton of write throughput. Now I want to add a new field called Phone (string). What is the best way to go about moving the data from one table to another?
Note: Dynamo doesn't let you rename tables, and fields cannot be null.
Here are the options I think I have:
Dump the table to .csv, run a script (overnight probably since it's a TB worth of data) to add a default phone number to this file. (Not ideal, will also lose all new data submitted into old table, unless I bring the service offline to perform the migration (which is not an option in this case)).
Use the SCAN api call. (SCAN will read all values, then will consume significant write throughput on the new table to insert all old data into it).
How can I do perform a dynamo migration on a large table w/o
significant data loss?
you don't need to do anything. This is NoSQL, not SQL. (i.e. there is no idiomatic way to do this as you normally don't need migrations for NoSQL)
Just start writing entries with the additional key.
Records you get back that are written before will not have this key. What you normally do is have a default value you use when missing.
If you want to backfill, just go through and read the value + put the value with the additional field. You can do this in one run via a scan or again do it lazily when accessing the data.