joiner transformation query - informatica

I have
Flat File1 (F1) with these columns - key1, col1, col2
Flat File2 (F2) with these columns - key2, col1, col2
and one table (T1) with these columns - key3, col1, col2
Requirement is to get data from all 3 sources based on the below checks -
when key1 in Flat file (F1) matches with key2 in Flat File(F2) - return all matching rows in F1 and F2
when key1 in Flat file (F1) doesnt matches with key2 in Flat File(F2) - Only then check should be done between flat file F1 and table T1 based on condition - key1 = key3 and if match is found - then return all matching rows in T1 and F1
To acheive teh above task
I created Joiner traNSFORMATION between these 2 sources - F1 (Master) and F2 (Detail) and got the matching rows, and the join type that i selected was "Detail outer Join"
Am stuck on how to do the remaining checks?
can anyone please guide?

You can follow below steps
First join FF1 and FF2 (Outer join FF2 so all data from FF1 comes
in).
Then use a router to group data that doesnt exist in FF2. You can send matching records to target (group 1).
Non matching records can be picked when ff1.key is not null but ff2.key2 is null. Pick those records and match with Table T1 using a JNR.
You can send these matching records to target.
Whole map should look like this -
sq_FF1 (master) |Grp 1 = ff1.key and ff2.key2 both NOT NULL (Matching)-------------------------------------------------> To TGT
| JNR ( ff1.key=ff2.key2) (Detail outer join) --> ROUTER -(2 groups) |Grp 2 = ff1.key is NOT NULL and ff2.key2 IS NULL (NonMatching) --> |
sq_FF2 (Detail) | JNR key1 = key3 (inner join) ---> To TGT
sq_T1 -----------------------------------------------------------------------------------------------------------------------------------------> |

Cant we bring resultant outcome of both the sets of data to one common tranformation (like union) -> and from there we have to implement
common logic.
i.e.
return all matching rows in F1 and F2
the remaining unmatched rows of F1 should be joined with Table T1
Finally the resultant outcome of the above 2 sets should be routed to one common tranformation (like union) -> and from there we have one common logic.
I have used joiner transf. to bring matching rows in F1 and F2 ->
used filter transf. with cond. to identify all unmatched rows of F1 with cond. Key2 is null ->
used joiner transf. to link table T1 with the records that were indetified as part of filter ->
The result identified as part of step1 and step3 are routed to Union
But THere is an issue when we merge data using union transf. as we bringing data based on join type "Detail outer join" (due to which the data seem to get duplicated). How to get rid of this issue?

Related

I want to Assign 'Y' to the Duplicate Records and 'N' to the Unque Records, And Display those 'Y' and 'N' Flags in 'Duplicate' Column

I want to Assign 'Y' to the Duplicate Records and 'N' to the Unique Records, And Display those 'Y' and 'N'
Flags in 'Duplicate' Column.
Like Below
Source Table:
Name,Location
Vivek,India
Vivek,UK
Vivek,India
Vivek,USA
Vivek,Japan
Target Table:
=============
Name,Location,Duplicate
Vivek,India,Y
Vivek,India,Y
Vivek,Japan,N
Vivek,UK,N
Vivek,USA,N
How to Create a Mapping in Informatica Powercenter?
Which Logic I Should use?
[See the Image for More Clarification][1]
[1]: https://i.stack.imgur.com/2F20A.png
You need to calculate count grouping by key columns using aggregator. And then join back to original flow based on key columns.
use Sorter sort the data based on key columns like name and country in your example.
use Aggregator to calculate count() group by key columns.
out_count= count(*)
in_out - key_column
use Joiner to join aggregator data and sorter data based on key columns. Drag out_count and key columns from aggregator to joiner. Drag all columns from sorter. Do a inner join on key columns.
use Expression and create an out expression. Use out_count column to calculate your duplicate flag.
out_Duplicate = iif( out_count>1, 'Y','N')
Whole map should look like this
SRC -->SRT ---->AGG-->\
|------------->JNR-->EXP-->TGT
There's one more way to solve it without the Joiner, which is costly. I'm going to use the Name, Location sample columns from your example.
Use the Sorter on the Name and Location
Add an Expression with variable port for each key column called e.g. v_prev_Name and v_prev_Location.
Assign the expressions accordingly:
v_prev_Name = Name
v_prev_Location = Location
Next create another variable v_is_duplicate with following expression:
IIF(v_prev_Name = Name and v_prev_Location = Location, 1, 0)
Move v_is_duplicate up the list of ports so that it is before v_prev_Name and v_prev_Location - THIS IS IMPORTANT. The order needs to be:
v_is_duplicate
v_prev_Name
v_prev_Location
Add output port is_duplicate with expression simply matching v_is_duplicate.

Redshift Pivot Function

I've got a similar table which I'm trying to pivot in Redshift:
UUID
Key
Value
a123
Key1
Val1
b123
Key2
Val2
c123
Key3
Val3
Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
Question: What's the best way to replace the IN part with sub query which takes distinct values from Key column?
What I am trying to achieve;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.
Updated with requested stored procedure with results to a cursor example:
In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.
Here's what you want to happen as individual SQL statements but first lets set up the test data:
create table test (UUID varchar(16), Key varchar(16), Value varchar(16));
insert into test values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');
The actions you want to perform are first to create a string for the PIVOT clause IN list like so:
select '\'' || listagg(distinct "key",'\',\'') || '\'' from test;
Then you want to take this string and insert it into your PIVOT query which should look like this:
select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( 'Key1', 'Key2', 'Key3')
);
But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.
We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:
CREATE OR REPLACE procedure pivot_on_all_keys(curs1 INOUT refcursor)
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
OPEN curs1 for EXECUTE 'select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.
When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.
call pivot_on_all_keys('mycursor');
All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.
fetch all from mycursor;
I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:
fetch 1000 from mycursor;
The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:
call pivot_on_all_keys('mycursor'); fetch all from mycursor; close mycursor;
Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).
For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.
===================================
To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:
CREATE OR REPLACE procedure pivot_on_all_keys()
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
EXECUTE 'drop table if exists test_stage;';
EXECUTE 'create table test_stage AS select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
call pivot_on_all_keys();
select * from test_stage;
If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.
A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:
SELECT partname, avg(price) as avg_price FROM Part GROUP BY partname;
can be written as:
SELECT partname, avg(price) as avg_price FROM Part PIVOT BY partname;
Even multi-level pivoting can also be handled in the same syntax.
SELECT year, partname, avg(price) as avg_price FROM Part PIVOT BY year, partname;

what is the order to be evaluated in redshift

I tried two type of joining condition in Redshift first I tried where after join on and second,I tried and after join on.I assumed that where is executed after join so that in this case it must be scaned so much rows.
explain
select
*
from
table1 t
left join table2 t2 on t.key = t2.key
where
t.snapshot_day = to_date('2021-12-18', 'YYYY-MM-DD');
XN Hash Right Join DS_DIST_INNER (cost=43055.58..114637511640937.91 rows=2906695 width=3169)
Inner Dist Key: t.key
Hash Cond: (("outer".asin)::text = ("inner".asin)::text)
-> XN Seq Scan on table2 t2 (cost=0.00..39874539.52 rows=3987453952 width=3038)
-> XN Hash (cost=35879.65..35879.65 rows=2870373 width=131)
-> XN Seq Scan on table1 t (cost=0.00..35879.65 rows=2870373 width=131)
Filter: (snapshot_day = '2021-12-18 00:00:00'::timestamp without time zone)
on the other hands, as follows,and is conditioned before join so that I assumed it is less rows to be scaned in join. but it returned too much rows and consume huge cost as follows greater than where clause
explain
select
*
from
table1 t
left join table2 t2 on t.key= t2.key
and
t.snapshot_day = to_date('2021-12-18', 'YYYY-MM-DD');
XN Hash Right Join DS_DIST_INNER (cost=40860915.20..380935317239623.75 rows=3268873216 width=3169)
Inner Dist Key: t.key
Hash Cond: (("outer".key)::text = ("inner".key)::text)
Join Filter: ("inner".snapshot_day = '2021-12-18 00:00:00'::timestamp without time zone)
-> XN Seq Scan on table2 t2 (cost=0.00..39874539.52 rows=3987453952 width=3038)
-> XN Hash (cost=32688732.16..32688732.16 rows=3268873216 width=131)
-> XN Seq Scan on table1 t (cost=0.00..32688732.16 rows=3268873216 width=131)
What is the difference between them ? where do I misunderstand in this case ?
If someone has opinion or materials please let me know
Thanks
There are several things happening here (in my original answer I missed that you were doing an outer join, so this is a complete rewrite).
WHERE happens before JOIN (ie, real-world databases don't use relational algebra)
A join merges two result-sets on common column(s). The query optimizer will attempt to reduce the size of those result-sets by applying any predicates from the WHERE clause that are independent of the join.
Conditions in the JOIN clause control how the two result-sets are merged, nothing else.
This is a semantic difference between your two queries: when you specify the predicate on t.snapshot_day in the WHERE clause, it limits the rows selected from t. When you specify it in the JOIN clause, it controls whether a row from t2 matches a row in t, not which rows are selected from t or which rows are returned from the join.
You're doing an outer join.
In an inner join, rows between the two result-sets are joined if and only if all conditions in the JOIN clause are matched, and only those rows are returned. In practice, this will limit the result set to only those rows in t that fulfill the predicate.
An outer join, however, selects all rows from the outer result-set (in this case, all rows from t), and includes values from the inner result-set iff all join conditions are true. So you'll only include data from t2 where the key matches and the predicate applies. For all other rows in t you'll get nulls.
That DS_DIST_INNER may be a problem
Not related to the join semantics, but for large queries in Redshift it's very expensive to redistribute rows. To avoid this, you should explicitly distribute all tables on the column that's used most frequently in a join (or used with the joins that involve the largest number of rows).
If you don't pick an explicit distribution key, then Redshift will allocate rows to nodes on a round-robin basis, and your query performance will suffer (because every join will require some amount of redistribution).

Categorised data in PowerBi

I am looking for a suggestion on how to categorised/group data in PowerBi.
For example,
I have set up a conditional column in Power Query to achieve the results seeing in “Group” column by saying if ID is 8304 then Group B, if ID is 8660 then Group F -- but the database is large and I am already facing a performance issue when trying to set up a report based on individual Groups, it takes long to load the data.
Is there any alternative or better approach to group data?
ID
Group
8015
A
8020
A
8229
A
8304
B
8389
B
8391
C
8414
D
8421
A
8469
A
8572
A
8619
F
8660
F
8663
J
9102
A
9104
K
9120
A
I have set up a conditional column in Power Query to achieve the results seeing in “Group” column by saying if ID is 8304 then Group B, if ID is 8660 then Group F
Instead of a conditional column, use a helper table to store these links.
You can add the information to your main table by joining the two tables.

Informatica: comparing the date field between two tables

i am new in informatica software. now i have two tables, say AAA and BBB table.
AAA: last_post_date
BBB: Trx_No, Field1, Field2, trx_date
I want to move BBB table to the target table which the trx_date must be greater than last_post_date. I cannot use joiner transformation as it doesnt have >, < , >= and <= operators. If I want to use lookup transformation, how to use it for this case or any other way can help me do this. I searched many websites about the lookup transformation , still don't know how to use it.
Please help.
Thanks!
I am assuming that AAA has only 1 row which contains last_post_date. If both the tables are in same database, you can use Source Qualifier override
select Trx_No, Field1, Field2, trx_date from BBB where trx_date > last_post_date
But if both tables are in different database and / or you are not able to create DB link between them, then use below solution.
After Source Qualifier for both source, use Expression transformation.
Add an output port in both Expression Transformations, say o_Dummy and hardcode the value as 1 (for both transformations)
Use Joiner and use normal join. Join condition would be o_Dummy = o_Dummy1.
After it use a filer to filter records where trx_date > last_post_date.
This would be your flow.
SQ_AAA -> Expression -> Joiner -> Filter -> Target
SQ_BBB -> Expression -^
Use the Source Qualifier to read data from BBB, followed by lookup to AAA and a filter witha a condition trx_date>last_post_date.
Ideally you'd use an unconnected lookup reffered from Expression variable port e.g. v_LastPostDate = IIF(ISNULL(v_LastPostDate), LKP.LoopkupToAAA, v_LastPostDate) - this would ensure you perform the lookup only once. Not that it would matter a lot with a single value, but I thought I'll share some good practice :)