I'm new to Informatica....What is the difference between Joiner and Union Transformation? Also, Should we use Router instead of Joiner, to increase performance when there are multiple sources?
Joiner
Using joiner we can remove duplicate rows
Joiner can be Normal,Right Outer,Left Outer,Full Outer Join
In Joiner we have one input group and one output group
Joiner implemented by using Joiner Transformation in Informatica.
Joiner Transformation combines data record horizontally based on a
join condition
Joiner Transformation combines data record horizontally based on a
join condition
Union
Union will not remove duplicate rows
Union is equlivalent to UNION ALL in SQL
In Union we have multiple input groups and one output group.
Union implemented by using Union Transformation in Informatica
Union Transformation combines data record vertically from multiple
sources
Union also supports hetregenous(different sources)
Now, Router transformation is an active and connected transformation. It is similar to the filter transformation used to test a condition and filter the data. In a filter transformation, you can specify only one condition and drops the rows that do not satisfy the condition. Where as in a router transformation, you can specify more than one condition and provides the ability for route the data that meet the test condition. Use router transformation if you need to test the same input data on multiple conditions.
So, when the data is coming from muliple sources you can use Router to route values accordingly. It will increase your performance and save time too.
Joiner
1.For two sources to be joined there must be at least a common column between those two with
same data types based on which it can be joined.
2.Horizontal merging of Sources can be done.
3.Types are a.Normal
b.Left Outer
c.Right Outer
d.Full outer
4.At a single time can join two sources at most.
5.Avoids duplicates if join condition is correct.
Union
1.In Union all the columns of the two sources must have similar data types
and Number of columns of source1 must be equal to no of columns in source2.
2.Vertical merging of Sources are done.
3.Does not have any types.
4.At a single time as many sources can be there.
5.As it is equal to Union all in SQL,So can have duplicates also.
Related
So I have two source tables lets call the, table1 and table2, and the destination table table3 - inside these tables there is information that needs to be extracted from columns of one table, columns of another table, and then combined to give entries of columns to the new table.
Think of it as a complex transformation; for example:
partial text in column1 extracted from table1 and complete text in column1 of table2 combined into 4 rows of column1 (depending on the JSON of column1 in table1) in new transformed table.
So it's not a 1 to 1 mapping between 1 table and another, but a 1 to many mapping where the 1 row of the source comes from a mix of one row from two source table that translates to many rows of the new destination table.
Is this something that glue jobs can accomplish? or am I better of just writing a throwaway Python script? You can assume that the size of the table is not of any concern
Provided you plan to run this process at some frequency, this is a perfect use case for Glue. If this is just a one off, Glue is also a fine choice, but Glue is primarily designed for repeated use.
In you glue script I expect you will end up joining the two tables, and then select new result columns and rows by combining your existing columns. Typically the pattern to follow would be to convert the dynamic frames (created by glue), into pyspark data frames, and then work with pyspark from there, converting back to a dynamic frame before outputting to the database.
Note that depending on your design you may not need to add rows, it of course depends on the outcome you are seeking, but Dynamo does have support for some nifty hierarchical approaches that may remove your need for multiple rows.
If you have more specific examples of schema and the outcomes you are seeking, I could show you a bit of example code.
I have two sets of identical data that I filter differently. One shows sales by location in test locations and the other in control locations. Is there a way to append the results in a table with a "Test/Control" flag based on the first set of slicers so that I can show all the locations color coded by the flag?
You have two options to achieve this. In the Model (DAX) you can create a calculated Table, and use the UNION function to append the two sets of rows together.
https://dax.guide/union/
However UNION is quite fussy - the two parameter tables must have the same set of columns. Sometimes you can overcome small differences by adding other functions, but complex transforms are harder and you cant debug.
For complex requirements, you can use the Power Query Editor - it has an Append Query button on the Home Ribbon. Each query you feed in can have complex transformations.
I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?
1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key
Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.
I am trying to solve an Informatica problem
I have two tables: Table A and Table B have the following structure
Table A
A_Key
A_Name
A_Address
A_PostalCode
A_Country
A_Latitude
A_Longitude
Table B
B_Key
B_Name
B_PostalCode
B_Latitude
B_Longitude
I need to combine A & B in order to have one output table that contains all the Attribute of A & B.
Since I am new to Informatica Data Quality tool, I am trying to find the logic how I can implement this.
Does anyone have a better solution?
You can use a Joiner Transformation to do this.
It has two groups - Master and Detail. Ideally, you should connect the table with lesser data to the Master and the table with additional data should be connected to Detail section.
Ensure your table data is sorted before connecting to the joiner. Also, enable the Sorted Input in the advanced section of the Joiner Transformation.
Again for powercenter, this scenario sounds more like a union to me and setting the missing colums to null from group b
I'm attempting to create a shared date dimensions between two fact tables in Power BI, based off of a relational data source.
Currently, if I include an unrelated dimension in the report, I get numbers duplicated across multiple rows, where they don't really apply.
I'm wondering if there is any way to tell Power BI that certain dimensions cannot be used with certain fact tables, similar to using IgnoreUnrelatedDimensions in SSAS.
Currently the only solution I can find is to create a separate date dimension, so that the two fact tables have no relationship that could be used to join them, however this would mean forfeiting the ability to do any time based comparisons.
Create a combined view of the fact tables with only compatible columns to be used for time based comparison:
In Query Editor, create new queries for your fact tables by
referencing i.e. right click original query and select "Reference".
Then in those "copies" cut out the incompatible dimensions.
Rename columns to align terminology (e.g. Sales Date ==> Transaction Date, Payment Date ==> Transaction Date).
Use "Merge Queries" function to combine the copies using Full Outer Join.
Join this merged view to your date dimension