Why joiner doesn't have not equal to operator in Informatica? - informatica

Why doesn't Informatica's joiner transformation support !=, >=, <= operators?
Why should they come up with a concept like lookup?

Joiner transformation is used for vertical consolidation.
e.g
order-tbl
order-id, item-id, item-qty
item-tbl
item-id, item-price, item-desc
Using a join condition on order-tbl.item-id = item-tbl.item-id you could print a report like this
order-id, item-id, item-price, item-desc
For vertical consolidation, I can't think of a scenerio needing other conditions like !=, >=, <=.
With lookup transformation, some core ETL tasks are made easy like
identifying if the incoming record is a new record (primary key doesn't exist) or an update to the existing record;
lookup a value e.g. lookup item-price from item-tbl to calculate order total.

Now you can join heterogeneous sources with a non-equi join condition using a Lookup Transformation's "Multi Match" feature
You can download sample from following Informatica Marketplace
https://community.informatica.com/solutions/mapping_multi_match_lookup_join

Related

How can I use Arithmetic operators in PartiQL for DynamoDB?

According to what I have read in the docs, DynamoDB supports PartiQL, which in turn supposedly supports arithmetic operators (such as +, -, *, /).
https://docs.aws.amazon.com/qldb/latest/developerguide/ql-operators.html
When I try to perform a sum in a query, where I try to add 1 to each and every element of a specific column (my_column) which is of type N - number, as in:
SELECT eventTime, my_column + 1
FROM MyTable
WHERE eventTime BETWEEN 1615161600000 AND (1615165200000)
The above statement results in the following error:
What am I missing or doing wrong?
Thanks in advance.
PartiQL support arithmetic operators but you use-case in invalid. PartiQL does not have the ability to add a value to a column for each and every in the table, as its a NoSQL table, each item will have to be directly invoked.
In order to do that you would need to Scan the table to retrieve every item, then iterate through the list of items, using an update command:
PSEUDO:
results = `SELECT PK, SK FROM "myTable";`
FOR item in results:
UPDATE "myTable"
SET AwardsWon=AwardsWon+1
WHERE PK=item.PK AND SK=item.SK
Keep in mind the above is pseudo code. I personally would prefer using the low level DynamoDB API, I find it easier to do such things.

Can I use `contains` in a query for DynamoDB GSI?

I am new to DynamoDB. If I make a GSI, can I do a query with KeyConditionExpression: contains (GSI, :val1) and contains(GSI, :val2)
Will it be a full scan?
You need to do a Scan. The query only supports things like begins_with, >=, <=, >, <, =, or between.
See Key Condition Expressions for Query
Not with the Query API, as #Maurice says. However, you can achieve the same "query, not scan" end result with ExecuteStatement and a PartiQL statement with an IN operator applied to the index key in question. For example, 2 partition key values:
SELECT * from "my_table"."GSI" WHERE my_gsi_pk_key IN ['val1', 'val2']
It executes as a query operation. This answer has a complete example.
N.B. You have no choice but to use PartiQL here, but in general I would recommend avoiding it whilst learning DynamoDB. The core API better forces you to learn DynamoDB's idioms and unlearn RDBMS/SQL thinking.

Querying DynamoDB with a partition key and list of specific sort keys

I have a DyanmoDB table that for the sake of this question looks like this:
id (String partition key)
origin (String sort key)
I want to query the table for a subset of origins under a specific id.
From my understanding, the only operator DynamoDB allows on sort keys in a Query are 'between', 'begins_with', '=', '<=' and '>='.
The problem is that my query needs a form of 'CONTAINS' because the 'origins' list is not necessarily ordered (for a between operator).
If this was SQL it would be something like:
SELECT * from Table where id={id} AND origin IN {origin_list}
My exact question is: What do I need to do to achieve this functionality in the most efficient way? should I change my table structure? maybe add a GSI? Open to suggestions.
I am aware that this can be achieved with a Scan operation but I want to have an efficient query. Same goes for BatchGetItem, I would rather avoid that functionality unless absolutely necessary.
Thanks
This is a case for using Filter Expressions for Query. It has IN operator
Comparison Operator
a IN (b, c, d) — true if a is equal to any value in the list — for
example, any of b, c or d. The list can contain up to 100 values,
separated by commas.
However, you cannot use condition expressions on key attributes.
Filter Expressions for Query
A filter expression cannot contain partition key or sort key
attributes. You need to specify those attributes in the key condition
expression, not the filter expression.
So, what you could do is to use origin not as a sort key (or duplicate it with another attribute) to filter it after the query. Of course filter first reads all the items has that 'id' and filters later which consumes read capacity and less efficient but there is no other way to query that otherwise. Depending on your item sizes and query frequency and estimated number of returned items BatchGetItem could be a better choice.

Redshift: Does key-based distribution optimize equality filters?

This documentation describes key-distribution in redshift as follows:
The rows are distributed according to the values in one column. The
leader node will attempt to place matching values on the same node
slice. If you distribute a pair of tables on the joining keys, the
leader node collocates the rows on the slices according to the values
in the joining columns so that matching values from the common columns
are physically stored together.
I was wondering if key-distribution additionally helps in optimizing equality filters. My intuition says it should but it isn't mentioned anywhere.
Also, I saw a documentation regarding sort-keys which says that to select a sort-key:
Look for columns that are used in range filters and equality filters.
This got me confused since sort-keys are explicitly mentioned as a way to optimize equality filters.
I am asking this because I already have a candidate sort-key on which I will be doing range queries. But I also want to have quick equality filters on another column which is a good distribution key in my case.
It is a very bad idea to be filtering on a distribution key, especially if your table / cluster is large.
The reason is that the filter may be running on just one slice, in effect running without the benefit of MPP.
For example, if you have a dist key of "added_date", you may find that all of the added date for the previous week are all together on one slice.
You will then have the majority of queries filtering for recent ranges of added_date, and these queries will be concentrated and will saturate that one slice.
The simple rule is:
Use DISTKEY for the column most commonly joined
Use SORTKEY for fields most commonly used in a WHERE statement
There actually are benefits to using the same field for SORTKEY and DISTKEY. From Choose the Best Sort Key:
If you frequently join a table, specify the join column as both the sort key and the distribution key.
This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Feel free to do some performance tests -- create a few different versions of the table, and use INSERT or SELECT INTO to populate them. Then, try common queries to see how they perform.

why use lookup transformation instead of just a source file with joiner?

so i'm learning informatica powercenter
(at least through cloud designer)
i'm trying to figure out why we would use a lookup transformation to retrieve data based on a key, when we can just use a source transformation and join the data based on the key
i did both situations and they both accomplished the same thing using 2 different tables (flat files, csv)
why would i use a lookup transformation (besides having 1 transformation instead of 2 (source + joiner))
There are several kinds of lookup transformation which solve some particular scenarios. Those cannot be done using a joiner. For example, unconnected lookup, un-cached lookup, dynamic cache lookup, active and passive lookups have their unique uses
One big advantage of the Lookup transformation is the disconnected mode:
You can perform the Lookup based on a condition
You can also use the same lookup several times on several fields (e.g. you want to retrieve the name of two different customers, the payer and the ship-to, from the same dimension table)
More generally (i.e. not specific to the unconnected Lookups):
You can perform a Lookup on an inequality, which is not possible with the Joiner (e.g. retrieve the status of the customer at the current date, having a begin and end of validity date in the lookup table)
You can retrieve the first / last value based on the sort criteria if there are more than one record satisfying the Lookup condition
This comes in addition to what has already been said: better readability, especially in case of multiple Lookups, Dynamis Lookup Cache, etc.
Hope this helps :)