When do prefer Dynamic Lookup over static lookup ? can we perform SCD type2 using dynamic lkp? - informatica

Dynamic cache refresh when new record get inserted, update or deleted from lkp src file. can we perform SCD type 1 and type 2 using dynamic lookup.

In order to build SCD mappings, you need to check if the data is in your target. You can do that by simply reading the target as a source and using a Joiner Transformation. It's possible to use a dynamic Lookup, but that is not really needed.
Dynamic lookup will be useful if you need to wory about multiple data in your source for the same Business Key in a single mapping execution. This includes - but is not limited to - duplicates. For example if you load invoices from your source and for whatever reason invoice 123 is mentioned twice with different date, like:
| Row ID | Invoice No | Invoice Date
| 1 | 123 | 20210915
| 2 | 123 | 20210926
This is a case when by using Dynamic Lookup Cache it's possible to recognize that such row has already been inserted when processing row 1 and row 2 should be flagged as update. This wouldn't be possible otherwise.
Dynamic Lookup Cache might be also used to remove duplicates in scenarios where memory usage matters. It does not need to read and cache full set of data as in case of Aggregator Transformation or Sorter Transformation.

Related

Is it efficient to use multiple global indexes on a single DynamoDB table?

There exits a data-set as described in the below table. Sr.no is used in the below table only for reference
|sr.no| id | tis |data-type| b.id |idType_2| var_2 |
|-----|----------|-----|---------|----------|--------|--------|
| 1 |abc-def-gi|12345| a-type |1234567890| 843023 | NULL |
|-----|----------|-----|---------|----------|--------|--------|
| 2 |1234567890|12346| b-type | NULL | NULL |40030230|
|-----|----------|-----|---------|----------|--------|--------|
| 3 |abc-def-gj|12347| a-type |1234567890| 843023 | NULL |
Query types
Input id and if data-type is a-type return fields tis,b.id,id_type2 reference sr.no=1
Input id and if data-type is b-type return field var_2 reference sr.no=2
Input id_type2 return fields id,tis,b.id of sr.no=1,3
Input data-type return id based on tis between 12345 and 12347
Note
sr.no=1,3 or a-type of data is inserted 100k a times a day with unique id
sr.no=2 or b-type of data is a fixed set of
data.
Is the below key approach efficient for a dataset like this? Is there any other approach that can be followed to store and retrieve data from DynamoDB?
Partition Key = idto take care of Query 1,2.
GSI1=id_type2 and GSI1SK=id to take care of Query 3
GSI2=data-type and GSI2SK=tis to take care of Query 4
Here are my thoughts:
1) if you have data that has different access patterns you should consider splitting the data into different tables
2) if data is accessed together, store it together - what this means is that if whenever you read a-type data for some modeled entity, you also need to read one or more b-type records for the same entity, it is advantageous to place all these records in the same table, under the same partition key
To bring this all home, in your example, the ID for type a and type b data is different. This means that you get 0 benefit from storing both type a and type b in the same table. Use two different tables.
3) data that is not accessed together does not benefit at all from being placed in the same table and in fact has the potential to become an issue in more extreme circumstances
The main difference between relational vs non-relational databases is that in non-relational stores you don't have cross table joins, therefore whereas one of the tenets of relational databases is data normalization the opposite tends to be the case for non-relational.
This was solved by dong the following insde DynamoDB wihout creating any GSI.
When a GSI is created, whatever data is written in the main table is copied into the GSI table so WriteCost is x Number of GSIs. If you have 1 GSI this is PrimaryWrite+GSIWrite if you have 2 GSIs, then it's Primary + GSI1 + GSI2. Also the write into the GSI is the same as the primary so if you're wiriting into primary at 1000 WCU, the same will apply to the GSI so it will be a total of 2000 WCU for 1GSI and 3000WCU for 2 GSIs.
What we did
application_unique_id as hash key
timestamp as sort key
The rest of the keys were stored as attributes (DynamoDB supports dynamic JSON provided there is a valid hash key and a sort key).
We used a Lambda function attached to the table's DynamoDB Stream to write data into an ElasticSearch cluster.
We made a daily index of the latest snapshot data as DynamoDB holds all the trace points and is the best place to keep and query those.
This way we knew what data was sent on which day (as dynamodb doesn't let the user export a list of hash-keys). And we could do all the rest of the projected and comparison queries inside ElasticSearch.
DynamoDB solved the querying time series data at sub millisecond latency level
ElasticSearch solved the problem of all the comparison and filter operations on top of the data.
Set DynamoDB ttl to 30days, ElasticSearch doesn't support ttl however we drop the daily index once the index creation day crosses 30days.

Optimizing hierarchical data sets for reads of whole hierarchies

I am migrating an app from Oracle to Google Spanner.
One of the cases we came across are relationships between rows in the same table.
These relationships have a tree-like structure, always having one parent and one root of the hierarchy altogether. Bottom up and top to bottom query patterns are possible.
There will be cases where we'd like to have efficient access to the whole record-tree. This data access pattern is latency critical.
The application previously used Oracle and their hierarchical queries (connect by) and was highly optimized for that vendor.
The number of rows in one tree-fetch would range between 1-2000.
Table will have millions of sych rows.
Rows of that table do have interleaved child table rows within.
Would it make much sense to optimize the table for better data locality by denormalizing the model and redundantly adding the root record's id
as the first column of the primary key of that table for faster top-down queries?
It would go like this:
root_id | own_id | parent_id
1 | 1 | 1
1 | 2 | 1
1 | 3 | 2
4 | 4 | 4
4 | 5 | 4
4 | 5 | 4
Ie. we are considering to make PK consist of (root_id, own_id) here. (values are superficial, we can spread them out in real scenario).
What is the chance for such rows, containing same first element of the PK to go to the same split? Would there be actual benefit to do so?
Cloud Spanner supports parent-child table relationships to declare a data locality relationship between two logically independent tables, and physically co-locate their rows for efficient retrieval.
Please see this link for more information: https://cloud.google.com/spanner/docs/schema-and-data-model#parent-child_table_relationships
For example, assuming we have a table 'Root' with primary key 'root_id', we can declare the table 'Own' to be a child of the 'Root' table. The primary key of the parent table becomes a prefix to the primary key of the child table. So table 'Own' could have a primary key of (root_id, own_id). All rows of table 'Own' having the same 'root_id' would be located in the same split.
Splits do have a max size limit. As a rule of thumb, the size of every set of related rows in a hierarchy of parent-child tables should be less than a few GiB.

How to choose the primary key in DynamoDB if the item can only be uniquely identified by three or more attributes?

How to choose the primary key in DynamoDB if the item can only be uniquely identified by three or more attributes? Or this is not the correct way to use NOSQL database?
Generally, if your items are uniquely identified by three or more attributes you can concatenate the attribute values and form a composite string key that you can use as a hash key in the Dynamo tbale.
You can de-duplicate the attribes from the hash key into separate attributes on the item if you need to create indexes on them, or if you need to use them in conditional expressions.
The rules for relational databases normal form don't necessarily apply to NoSQL databases and, in fact, a denormalized schema is usually preferred.
To expand the concept, it is typical (and usually desirable) when designing relational database schemas to use normalized form. One of the normalized forms dictates that you should not duplicate data that represents the same "thing" in your database.
I'm going to use an example that has just two parts to the key but you can extend it further.
Let's say you're designing a table that contains geographical information in the united states. In US, a Zip Code consists of 5 digits and an additional 4 digits that may subdivide the region.
In a relational database you might use the following schema:
Zip | Plus4 | CityName | Population
---------+-----------+---------------+---------------
CHAR(5) | CHAR(4) | NVARCHAR(100) | INTEGER
With a composite primary key of Zip, Plus4
This is perfect because the combination of Zip and Plus4 is guaranteed to be unique and you can answer any query against this table regardless of whether you have both the Zip and the additional Plus4 code, or just the Zip. And you can also get all the Plus4 codes for a Zip code rather easily.
If you wanted to store the same information in Dynamo you might create a hash key called "ZipPlus4" that is of type String and which consists of the Zip code concatenated with the Plus4 code (ie. 60210-4598) and then also store two more attributes on the item, one of which is the Zip code by itself and the other which is the Plus4 by itself. So an item in your table might have the following attributes:
ZipPlus4 | Zip | Plus4 | CityName | Population
-----------+---------+----------+-------------+---------------
String | String | String | String | Number
The ZipPlus4 above would be the Hash key for the table.
Note that in the example above you could get away with having a hash key of "Zip" and a range key of "Plus4" but as you saw, when you have more than 2 parts you need something different.

How to select entire intermediate table in Django

In my database I have citations between objects as a ManyToMany field. Basically, every object can cite any other object.
In Postgres, this has created an intermediate table. The table has about 12 million rows, each looks roughly like:
id | source_id | target_id
----+-----------+-----------
81 | 798429 | 767013
80 | 798429 | 102557
Two questions:
What's the most Django-tastic way to select this table?
Is there a way to iterate over this table without pulling the entire thing into memory? I'm not sure Postgres or my server will be pleased if I do a simple select * from TABLE_FOO.
The solution I found to the first question was to grab the through table and then to use values_list to get a flattened result.
So, from my example, this becomes:
through_table = AcademicPaper.papers_cited.through
all_citations = through_table.objects.values('source_id', 'target_id')
Doing that runs the very basic SQL that I'd expect:
print all_citations.query
SELECT 'source_id', 'target_id' FROM my_through_table;
And it returns flattened ValueList objects, which are fairly small and I can work with very easily. Even in my table with 12M objects, I was actually able to do this and put it all in memory without the server freaking out too much.
So this solved both my problems, though I think the advice in the comments about cursors looks very sound.

SQLite3 for Serialization Purposes

I've been tinkering with SQLite3 for the past couple days, and it seems like a decent database, but I'm wondering about its uses for serialization.
I need to serialize a set of key/value pairs which are linked to another table, and this is the way I've been doing this so far.
First there will be the item table:
CREATE TABLE items (id INTEGER PRIMARY KEY, details);
| id | details |
-+----+---------+
| 1 | 'test' |
-+----+---------+
| 2 | 'hello' |
-+----+---------+
| 3 | 'abc' |
-+----+---------+
Then there will a table for each item:
CREATE TABLE itemkv## (key TEXT, value); -- where ## is an 'id' field in TABLE items
| key | value |
-+-----+-------+
|'abc'|'hello'|
-+-----+-------+
|'def'|'world'|
-+-----+-------+
|'ghi'| 90001 |
-+-----+-------+
This was working okay until I noticed that there was a one kilobyte overhead for each table. If I was only dealing with a handful of items, this would be acceptable, but I need a system that can scale.
Admittedly, this is the first time I've ever used anything related to SQL, so perhaps I don't know what a table is supposed to be used for, but I couldn't find any concept of a "sub-table" or "struct" data type. Theoretically, I could convert the key/value pairs into a string like so, "abc|hello\ndef|world\nghi|90001" and store that in a column, but it makes me wonder if that defeats the purpose of using a database in the first place, if I'm going to the trouble of converting my structures to something that could be as easily stored in a flat file.
I welcome any suggestions anybody has, including suggestions of a different library better suited to serialization purposes of this type.
You might try PRAGMA page_size = 512; prior to creating the db, or prior to creating the first table, or prior to executing a VACUUM statement. (The manual is a bit contradictory and it also depends on the sqlite3 version.)
I think it's also kind of rare to create tables dynamically at a high rate. It's good that you are normalizing your schema, but it's OK for columns to depend on a primary key and, while repeating groups are a sign of lower normalization level, it's normal for foreign keys to repeat in a reasonable schema. That is, I think there is a good possibility that you need only one table of key/value pairs, with a column that identifies client instance.
Keep in mind that flat files have allocation unit overhead as well. Watch what happens when I create a one byte file:
$ cat > /tmp/one
$ ls -l /tmp/one
-rw-r--r-- 1 ross ross 1 2009-10-11 13:18 /tmp/one
$ du -h /tmp/one
4.0K /tmp/one
$
According to ls(1) it's one byte, according to du(1) it's 4K.
Don't make a table per item. That's just wrong. Similar to writing a class per item in your program. Make one table for all items, or perhaps, store the common parts of all items, with other tables referencing it with auxillary information. Do yourself a favor and read up on database normalization rules.
In general, the tables in your database should be fixed, in the same way that the classes in your C++ program are fixed.
Why not just store a foreign key to the items table?
Create Table ItemsVK (ID integer primary key, ItemID integer, Key Text, value Text)
If it's just serialization, i.e. one-shot save to disk and then one-shot restore from disk, you could use JSON (list of recommend C++ libraries).
Just serialize a datastructure:
[
{'id':1,'details':'test','items':{'abc':'hello','def':'world','ghi':'90001'}},
...
]
If you want to save some bytes, you can omit the id, details, and items keys and save a list instead: (in case that's a bottleneck):
[
[1,'test', {'abc':'hello','def':'world','ghi':'90001'}],
...
]