how to implement row level summmary function in informatica - informatica

Can someone please explain me how to implement the following logic in informatica. but Not with source qualifier with other transformations inside the mapping.
SUM(WIN_30_DUR) OVER(PARTITION BY AGENT_MASTER_ID ORDER BY ROW_DT ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING)
Basically this is sql(oracle) level requirement but i want at informatica level.

Use the Aggregator to calculate sums groupped by AGENT_MASTER_ID and self-join it on AGENT_MASTER_ID. Make sure to have the data sorted and use 'sorted input' property for aggregator - it will also be mandatory for a self join.

Related

To filter specific records

I have a requirement to filter (flatfile) only those records who has the colA values as 1,2,3,4,5,6 and also ColB as 'N'. The records that satisfy this condition from the source file should process to target.
Earlier it was said to check for only one value from colA. So therefore i applied
IIF(COLA='1' AND COLB'N',TRUE)
How to filter with multiple values for the same column? I am new to informatica power center.
There are two ways you can achieve this in an expression : using OR logical operator or using IN function.
With OR
IIF((COLA='1' OR COLA='2' OR COLA='3' OR COLA='4' OR COLA='5) AND COLB='N',TRUE)
Parenthesis are essential to group conditions on COLA.
With IN
IIF(IN(COLA,'1','2,'3','4','5') AND COLB='N',TRUE)
I find this one easier to read.

How to get row count for large dataset in Informatica?

I am trying to get the row count for a dataset with 280 fields with out having affect on the performance. Looking for best possible ways to perform.
The better option to avoid performance issue is, use sorter transformation and sort the columns and pass the pipeline to aggregator transformation. In aggregator transformation please check the option sorted input.
In terms if your source is a database then, index the required conditional columns in the table and also partition the table if required.
For your solution, I have in mind 2 options:
Using Aggregator (remember to use a predefined order by to improve performance with the next trans), SQ > Aggregator > Target. Inside the aggregator add new ports with the sum() and/or count() functions. Remember to select the columns to group
Check this out this example:
https://www.guru99.com/aggregator-transformation-informatica.html
Using Source Qualifier query override. Use a traditional select count/sum with group by from the database- SQ > Target.
By the way. Informatica is very good with the performance, more than the columns you need to review how many records you are processing. A best practice is always to stress the datasource/database more than the Infa app.
Regards,
Juan
If all you need is just to count the rows, use the Aggregator. That's what it's for. However, this will create cache - to limit it's size, use a single port.
To avoid caching, you can use a variable in expression and just increment it. This however will give you an extra column with all rows numbered, not just a single value. You'll still need to aggregate it. Here it would be possible to use aggregater with no function to return just the last value.

Pentaho PDI get SQL SUM() with conditions

I'm using Pentaho PDI 7.1. I'm trying to convert data from Mysql to Mysql changing the structure of data.
I'm reading the source table (customers) and for each row I've to run another query to calculate the balance.
I was trying to use Database value lookup to accomplish it but maybe is not the best way.
I've to run a query like this to get the balance:
SELECT
SUM(
CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END
)
FROM Movimento WHERE contoFidelizzato_id = ?
I should set the parameter taking it from the previous step. Some advice?
The Database lookup value may be a good idea, especially if you are used to database reasoning, but it may result in many queries which may not be the most efficient.
A more PDI-ish style would be to make the query like:
SELECT contoFidelizzato_id
, SUM(CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END)
FROM Movimento
GROUP BY contoFidelizzato_id
and use it as the info source of a Lookup Stream Step, like this:
An even more PDI-ish style would be to divert the source table (customer) in two flows : one in which you keep the source rows, and one that you group by contoFidelizzato_id. Of course, you need a formula, or a Javascript, or to put a formula in the SQL of the Table input to change the sign when needed.
Test to know which strategy is better in your case. You'll soon discover that the PDI is very good at handling large data.

Aggregator transformation in Informatica

Is it compulsory to select group-by while performing count operation in the aggregator transformation in Informatica
In order to perform a count, you have to specify atleast one column in group by to AGGREGATOR transformation to let it know that it has to perform grouping on that column.
Even if you don't provide GROUP BY also, the mapping will not fail, but you won't get the expected result.
While using Aggregator transformation, you need to check group by as the result returns each row by performing aggregation one by one and the passes to the pipeline. If no group by is checked, the last row will be processed and it will return only single row (last row) as it has no command to aggregate data. In order to perform count with respect to specific column, it is manditory to check group by for required columns.
If you hesitate to group by, you can use expression transformation and use count function to perform aggregation for required column without grouping.
Thank you
It's not mandatory to select atleast one port as group by. Hiwever, if you don't choose any group by port - Infa will return only last row.
Hope this helps

How to Scan HBase Rows efficiently

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks
-Panks
A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.
You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.
I am just getting started with HBase, bloom filters might help.
You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.