Informatica session refuses to update first then insert in "update then insert" mode - informatica

Very basic setup: source-to-target - wanted to replicate the MERGE behavior.
Removed the update strategy, activated "update then insert" rule on target within the session. Doesn't work as described, always attempts to insert into the primary key column, even though the same key arrives, which should have triggered an "update" statement. Tried other target methods - always attempts to insert. Attached is the mapping pic.
basic merge attempt

Finally figured this out. You have to make edits in 3 places: a) mapping - remove update strategy b) session::target properties - set the "update then insert" method c) session's own properties - "treat source rows as"
In the third case you have to switch "treat source rows as" from insert to update.
Which will then allow both - updates and inserts.
Why is it set like this is beyond me. But it works.

I'll make an attempt to clarify this a bit.
First of all, using Update Strategy in the mapping requires the session Treat source rows as property to be set to Data driven. This is slowest possible option as it means it will be set on row-by-row basis within the mapping - but that's exactly what you need if using the Update Strategy transformation. So in order to mirror MERGE, you need to remove it.
And tell the session not to expect this in the mapping anymore - so the property needs to be set to one of the remaining ones. There are two options:
set Treat source rows as to Insert - this means all the rows will be inserted each time. If there are no errors (e.g. caused by unique index), the data will be multiplied. In order to mimic MERGE behavior, you'd need to add the unique index that would prevent inserts and tell the target connector to insert else update. This way in case the insert fails it will make an update attempt.
set Treat source rows as to Update - now this will tell PowerCenter to try updates for each and every input row. Now, using update else insert will cause that in case of failure (i.e. no row to update) there will be no error - instead an insert attempt will be made. Here there's no need for unique index. That's one difference.
Additional difference - although both solutions will reflect the MERGE operation - might be observed in performance. In the environment where new data is very rare, the first approach will be slow: each time an insert attempt will be made just to fail and do an update operation then. Just a few times it will succeed at first attempt. Second approach will be faster: updates will succeed most of the time and just on a rare occasion it will fail and result in an insert operation.
Of course, if updates are not often expected, it will be exactly the opposite.
This can be seen as complex solution for a simple merge. But it also lets the developer to influence the performance.
Hope this sheds some light!

Related

Insert many items from list into SQLite

I have a list of lots of data (will be near 1000). I want to add it all in one go to a row. Is this straight forward like a for loop over list with multiple inserts?multiple commits? Is this bad practice?thanks
I haven’t tried yet as just setting up table columns which is many so need to know if feasible thanks
If you're using SQL to insert:
INSERT INTO 'tablename' ('column1', 'column2') VALUES
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2');
If you're using code... generate that above query using a for loop then run it.
For a more efficient approach consider a union as shown in: Is it possible to insert multiple rows at a time in an SQLite database?
insert into 'tablename' ('column1','column2')
select data1 as 'column1',data2 as 'column2'
union select data3,data4
union...
In sqlite you don't have network latency, so it does not really matter performance wise to issue many small requests toward the engine. For more reference about that you can read this page from the official documentation: https://www.sqlite.org/np1queryprob.html
But in write mode (insert or update), each individual query will have to pay the cost of an implicit transaction. To avoid that you need to gather your insert queries in an explicit transaction. Depending of your programming language, how you do that may vary. Here is a code sample on how to do that in go. I've simplified error code management, to have a better view of the gist.
tx, _ := db.Begin()
for _, item := range items {
tx.Exec(`INSERT INTO testtable (col1, col2) VALUES (?, ?)`, item.Field1, item.Field2)
}
tx.Commit()
If you detect an error in your loop instead calling tx.Commit() you need to call tx.Rollback() in order to cancel all previous writes to your database so that the final state is as if no insert query has been issued at all.

DynamoDB version control using sort keys

Anyone that has implemented versioning using sort keys as stated in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html?
Trying to implement this using typescript for building a database with versions of the items. Is there any way of doing this using updateItem or is it a get + put operation needed?
Any sample to get me started or help is much appreciated!
The concept of versioning using sort key involves the creation of a completely new item that uses same Partition Key and different Sort Key.
DynamoDB offers some operations that allow to update values within an object in an atomic way, this use case is perfect for when you have something like a counter or a quantity and you want to decrease/increase it without having to read its value first. - Docs here.
In the case you're trying to achieve, as mentioned, you are essentially creating a new object. DynamoDB, by itself, doesn't have any concept of versioning and what this pattern does is to cleverly leverage the relation between Partition Key and Sort Key and the fact that a PK can have multiple SK associated with it, to correlate multiple rows of the same table.
To answer your question, if your only source of truth (or data store) is DynamoDB, then yes, your client will have to first query the table to know which was the last version of the item being updated and then insert the new version.
In case you are recording this information elsewhere and are using DynamoDB only to store these versions, then no, one put operation will be enough but again, this assumes you can retrieve this info somewhere else.
In terms of samples, the official documentation of the AWS SDK is always a good start, in your case I assume you'll want to use the Javascript one which you can find here.
At a very high level, you'll have to do the following:
Create an AWS.DynamoDB() client.
Execute a query using the dynamodb.query() method and specifying the PK of the item you want to update.
Go through the items (rows) returned from the previous query and find the one with the bigger version number as SK.
Put a new item using the dynamodb.putItem() method passing an item with the incremented version number as SK and same PK.
You can do the technique described by Amazon with a read and then a write, or more accurately, a read and then two writes (since they want to update both v0 and a new v4!). Often, you need the extra read because you want to build the new version v4 based on data you read from v3 (or equivalently, v0) - but in case you don't need that, the read is not necessary, and two writes are enough:
You first do an UpdateItem to v0 which increments the "Latest" attribute, sets whatever attributes you want to set in the new version, and uses the ReturnValues parameter to ask the update operation to return the new "Latest" attribute.
Then you write with PutItem the new row for v4 (where 4 is the "Latest" you just read).
This approach is safe in the sense that if two clients try to create two new versions at the same time, each one will pick a different "Latest", and both will appear on the version histories. However, it is not safe in the sense that if the client dies between step 1 and 2, you'll have a "hole" in the version history. However, I don't think there's any implementation of this technique that doesn't suffer from this problem.
After saying this, I want to reiterate what I said in the first paragraph: In most realistic use cases, the new version would be based on the old version, so your code anyway needs to read the old version first, then decide how to change it - and then write it (twice). You can't avoid the read in these cases. By the way, in this case the first write (to v0) would be a conditional update to verify that you only write the new version if the old version is still the same one ("Latest" is the same one you read during the read) - otherwise you'd be basing your modification on a non-current version. This is an example of optimistic locking.

Using column versions for time series

In the official documentation there is a text for which I can't totally understand the reason:
When working with time series, do not leverage the transactional behavior of rows. Changes to data in an existing row should be stored as a new, separate row, not changed in the existing row. This is an easier model to construct, and it enables you to maintain a history of activity without relying upon column versions.
The last sentence is not obvious and concrete, so it doesn't convince me. For now, using versioning for updating the cell's data still looks to me like a good fit for the update task. At least versions are managed by BigTable, so it's simplier solution.
Can anybody please provide more obvious explanation of why the versioning shouldn't be used in that use case?
Earlier in that page under Patterns for row key design, a bit more detail is explained. The high level view being that using row keys instead of column versions will:
Make it easier to run queries across your data, allowing for scanning of less data.
Avoid going over the recommended maximum row size.
The one caveat being:
It is acceptable to use versions of a column where the use case is
actually amending a value, and the value's history is important. For
example, suppose you did a set of calculations based on the closing
price of ZXZZT, and initially the data was mistakenly entered as
559.40 for the closing price instead of 558.40. In this case, it might be important to know the value's history in case the incorrect value
had caused other miscalculations.

What's more efficient? Read and Write If... or always write to db?

I have a database table, that has a column, which is being updated frequently (relatively).
The question is:
Is it more efficient to avoid always writing to the database, by reading the object first (SELECT ... WHERE), and comparing the values, to determine if an update is even necessary
or always just issue an update (UPDATE ... WHERE) without checking what's the current state.
I think that the first approach would be more hassle, as it consists of two DB operations, instead of just one, but we could also avoid an unnecessary write.
I also question if we should even think about this, as our db will most likely not reach the 100k records in this table anytime soon, so even if the update would be more costly, it wouldn't be an issue, but please correct me if I'm wrong.
The database is PostgreSQL 9.6
It will avoid I/O load on the database if you only perform the updates that are necessary.
You can include the test with the UPDATE, like in
UPDATE mytable
SET mycol = 'avalue'
WHERE id = 42
AND mycol <> 'avalue';
The only downside is that triggers will not be called unless the value really changes.

How can I forward a primary key sequence in Django safely?

Using Django with a PostgreSQL (8.x) backend, I have a model where I need to skip a block of ids, e.g. after giving out 49999 I want the next id to be 70000 not 50000 (because that block is reserved for another source where the instances are added explicitly with id - I know that's not a great design but it's what I have to work with).
What is the correct/safest place for doing this?
I know I can set the sequence with
SELECT SETVAL(
(SELECT pg_get_serial_sequence('myapp_mymodel', 'id')),
70000,
false
);
but when does Django actually pull a number from the sequence?
Do I override MyModel.save(), call its super and then grab me a cursor and check with
SELECT currval(
(SELECT pg_get_serial_sequence('myapp_mymodel', 'id'))
);
?
I believe that a sequence may be advanced by django even if saving the model fails, so I want to make sure whenever it hits that number it advances - is there a better place than save()?
P.S.: Even if that was the way to go - can I actually figure out the currval for save()'s session like this? if I grab me a connection and cursor, and execute that second SQL statement, wouldn't I be in another session and therefore not get a currval?
Thank you for any pointers.
EDIT: I have a feeling that this must be done at database level (concurrency issues) and posted a corresponding PostgreSQL question - How can I forward a primary key sequence in PostgreSQL safely?
As I haven't found an "automated" way of doing this yet, I'm thinking of the following workaround - it would be feasible for my particular situation:
Set the sequence with a MAXVALUE 49999 NO CYCLE
When 49999 is reached, the next save() will run into a postgres error
Catch that exception and reraise as a form error "you've run out of numbers, please reset to the next block then try again"
Provide a view where the user can activate the next block, i.e. execute "ALTER SEQUENCE my_seq RESTART WITH 70000 MAXVALUE 89999"
I'm uneasy about doing the restart automatically when catching the exception:
try:
instance.save()
except RunOutOfIdsException:
restart_id_sequence()
instance.save()
as I fear two concurrent save()'s running out of ids will lead to two separate restarts, and a subsequent violation of the unique constraint. (basically same concept as original problem)
My next thought was to not use a sequence for the primary key, but rather always specify the id explicitly from a separate counter table which I check/update before using its latest number - that should be safe from concurrency issues. The only problem is that although I have a single place where I add model instances, other parts of django or third-party apps may still rely on an implicit id, which I don't want to break.
But that same mechanism happens to be easily implemented on postgres level - I believe this is the solution:
Don't use SERIAL for the primary key, use DEFAULT my_next_id()
Follow the same logic as for "single level gapless sequence" - http://www.varlena.com/GeneralBits/130.php - my_next_id() does an update followed by a select
Instead of just increasing by 1, check if a boundary was crossed and if so, increase even further