Joiner or lookup which is best to perform left join - informatica

My source1 has 100 records and source2 has 1M records and i want perform left join . Which one you prefer Joiner transformation or lookup transformation.

I would prefer lookup.
So you have below options based on your scenario/data -
if source 1 is in same database - try left joining in db itself in source qualifier. but you need to check the join condition and relationship.
if source 1 is not in same database -
2.1 use joiner - if you join on unique key and if you want many columns from source.
2.2 use lookup - if you do not have unique key to join and many-many relation exists between them.
Performance wise, option 1 will be better than option 2. Option 2.2 will be better than 2.1 because in 2.2, you have to cache only small table and joiner will cache complete data.

Related

Applying Left Join in Pentaho

I'm try to create Transformation and need to merge two Database based on query like that by using Merge Join and I little bit confuse what should i filled in First Step, Second Step to Lookup for that each query format.
Query Format :
SELECT * FROM A a LEFT JOIN B b on a.value=b.value
SELECT * FROM A a LEFT JOIN B b on b.value=a.value
There are various way to do it.
Write the sql with the join in the Table input step. Quick an dirty solution if your table are in the same database, but do not tell a PDI expert you did it that way.
If you know there is only one B record for each A record, use a Lookup Stream Step. Very, very, very efficient. The Main flow is the A and the lookup step is B.
If you have many B records for each A records, use a Join Rows. Don't be afraid, you do not really make a Cartesian product, as you can put your condition a.value=b.value.
In the same situation, you can also make a Merge join. The first step is the step you write fist in the sql select statement.
Multiple ways to do this.
you can use TableInput Step and just simply write your query. No need to do anything else for implementing above query.

Google Spanner - How do you copy data to another table?

Since spanner does not have ddl feature like
insert into dest as (select * from source_table)
How do we select subset of a table and copy that rows into another table ?
I am trying to write data to temporary table and then move data to archive table at the end of day. But only solution i could find so far is, select rows from source table and write them to new table. Which is done using java api, and it does not have a ResultSet to Mutation converter, so i need to map every column of table to new table, even they are exactly same.
Another thing is updating just one column data, like there is no way of doing "update table_name set column= column-1 "
Again to do that, i need to read that row and map every field to update Mutation, but this is not useful if have many tables, i need to code for all of them, a ResultSet -> Mutation converted would be nice too.
Is there any generic Mutation cloner and/or any other way to copy data between tables?
As of version 0.15 this open source JDBC Driver supports bulk INSERT-statements that can be used to copy data from one table to another. The INSERT-syntax can also be used to perform bulk UPDATEs on data.
Bulk insert example:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT C1, C2, C3
FROM OTHER_TABLE
WHERE C1>1000
Bulk update is done using an INSERT-statement with the addition of ON DUPLICATE KEY UPDATE. You have to include the value of the primary key in your insert statement in order to 'force' a key violation which in turn will ensure that the existing rows will be updated:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT COL1, COL2+1, COL3+COL2
FROM TABLE
WHERE COL2<1000
ON DUPLICATE KEY UPDATE
You can use the JDBC driver with for example SQuirreL to test it, or to do ad-hoc data manipulation.
Please note that the underlying limitations of Cloud Spanner still apply, meaning a maximum of 20,000 mutations in one transaction. The JDBC Driver can work around this limit by specifying the value AllowExtendedMode=true in your connection string or in the connection properties. When this mode is allowed, and you issue a bulk INSERT- or UPDATE-statement that will exceed the limits of one transaction, the driver will automatically open an extra connection and perform the bulk operation in batches on the new connection. This means that the bulk operation will NOT be performed atomically, and will be committed automatically after each successful batch, but at least it will be done automatically for you.
Have a look here for some more examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Another approach to perform Bulk update can be using LIMIT & OFFSET
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 1001);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 2001);
.
.
.
reach till where required.
PS: This is more of a trick. But will definitely save you time.
Spanner supports expression in the SET section of an UPDATE statement which can be used to supply a subquery fetching data from another table like this:
UPDATE target_table
SET target_field = (
-- use subquery as an expression (must return a single row)
SELECT source_table.source_field
FROM source_table
WHERE my_condition IS TRUE
) WHERE my_other_condition IS TRUE;
The generic syntax is:
UPDATE table SET column_name = { expression | DEFAULT } WHERE condition

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?
1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key
Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

Azure SQL DW CTAS of over 102,400 rows to one distribution doesn't automatically compress

I thought the way columnstores worked was that if you bulk load over 102,400 rows into one distribution of a columnstore, it would automatically compress it. I'm not observing that in Azure SQL DW.
I'm doing the following CTAS statement:
create table ColumnstoreDemoCTAS
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 102401 cast(1 as int) as Column1, f.*
from FactInternetSales f
cross join sys.objects o1
cross join sys.objects o2
Now I check the status of the columnstore row groups:
select t.name
,NI.distribution_id
,CSRowGroups.state_description
,CSRowGroups.total_rows
,CSRowGroups.deleted_rows
FROM sys.tables AS t
JOIN sys.indexes AS i
ON t.object_id = i.object_id
JOIN sys.pdw_index_mappings AS IndexMap
ON i.object_id = IndexMap.object_id
AND i.index_id = IndexMap.index_id
JOIN sys.pdw_nodes_indexes AS NI
ON IndexMap.physical_name = NI.name
AND IndexMap.index_id = NI.index_id
LEFT JOIN sys.pdw_nodes_column_store_row_groups AS CSRowGroups
ON CSRowGroups.object_id = NI.object_id
AND CSRowGroups.pdw_node_id = NI.pdw_node_id
AND CSRowGroups.distribution_id = NI.distribution_id
AND CSRowGroups.index_id = NI.index_id
WHERE t.name = 'ColumnstoreDemoCTAS'
ORDER BY 1,2,3,4 desc;
I end up with one OPEN rowgroup with 102401 rows. Did I misunderstand this behavior of columnstores? Is Azure SQL DW different?
I see the same behavior if I do an bulk insert from SSIS of the same number of rows all as one buffer.
I tried Drew's suggestion of inserting over 6.5 million rows and I still end up with all OPEN row stores:
create table ColumnstoreDemoWide
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION=HASH(Column1))
AS
select top 7000000 ROW_NUMBER() OVER (ORDER BY f.ProductKey) as Column1, f.*
from FactInternetSales f
cross join sys.objects o
cross join sys.objects o2
cross join sys.objects o3
placing your data in a clustered columnstore will not decrease the number of rows returned. Instead, it will compress the data stored so that it takes up less space on disk. This will mean that less data is moved for queries and you will be charged less for storage, but your results will stay the same. That being said, your data is currently residing in a deltastore, so you will not see any compression. Due to SQL DW's architecture we separate the data into a number of groups under the covers. This allows us to more easily parallelize computations and scale, but also means that each group will have it's own columnstore/deltastore, so you will need to load more rows to get the compression benefits.
In addition to the distribution structure there is a difference in thresholds for SQL Server when compared to SQL Data Warehouse. For DW the threshold was 1,048,576 until a defect was resolved as #JRJ describes. Now Azure SQL DW's threshold is 120,400 like the rest of the SQL family. Once your rows in a distribution exceeds this you should see that your rows are compressed.
You can find a bit more information on loading into a columnstore here: https://msdn.microsoft.com/en-US/library/dn935008.aspx
This was a defect in the service. The fix is currently being rolled out. If you try this out on Japan West for example you will see that the behaviour is as you would expect.

Difference between Joiner and Union Transformation

I'm new to Informatica....What is the difference between Joiner and Union Transformation? Also, Should we use Router instead of Joiner, to increase performance when there are multiple sources?
Joiner
Using joiner we can remove duplicate rows
Joiner can be Normal,Right Outer,Left Outer,Full Outer Join
In Joiner we have one input group and one output group
Joiner implemented by using Joiner Transformation in Informatica.
Joiner Transformation combines data record horizontally based on a
join condition
Joiner Transformation combines data record horizontally based on a
join condition
Union
Union will not remove duplicate rows
Union is equlivalent to UNION ALL in SQL
In Union we have multiple input groups and one output group.
Union implemented by using Union Transformation in Informatica
Union Transformation combines data record vertically from multiple
sources
Union also supports hetregenous(different sources)
Now, Router transformation is an active and connected transformation. It is similar to the filter transformation used to test a condition and filter the data. In a filter transformation, you can specify only one condition and drops the rows that do not satisfy the condition. Where as in a router transformation, you can specify more than one condition and provides the ability for route the data that meet the test condition. Use router transformation if you need to test the same input data on multiple conditions.
So, when the data is coming from muliple sources you can use Router to route values accordingly. It will increase your performance and save time too.
Joiner
1.For two sources to be joined there must be at least a common column between those two with
same data types based on which it can be joined.
2.Horizontal merging of Sources can be done.
3.Types are a.Normal
b.Left Outer
c.Right Outer
d.Full outer
4.At a single time can join two sources at most.
5.Avoids duplicates if join condition is correct.
Union
1.In Union all the columns of the two sources must have similar data types
and Number of columns of source1 must be equal to no of columns in source2.
2.Vertical merging of Sources are done.
3.Does not have any types.
4.At a single time as many sources can be there.
5.As it is equal to Union all in SQL,So can have duplicates also.