To filter specific records

To filter specific records - informatica

I have a requirement to filter (flatfile) only those records who has the colA values as 1,2,3,4,5,6 and also ColB as 'N'. The records that satisfy this condition from the source file should process to target.
Earlier it was said to check for only one value from colA. So therefore i applied
IIF(COLA='1' AND COLB'N',TRUE)
How to filter with multiple values for the same column? I am new to informatica power center.

There are two ways you can achieve this in an expression : using OR logical operator or using IN function.
With OR
IIF((COLA='1' OR COLA='2' OR COLA='3' OR COLA='4' OR COLA='5) AND COLB='N',TRUE)
Parenthesis are essential to group conditions on COLA.
With IN
IIF(IN(COLA,'1','2,'3','4','5') AND COLB='N',TRUE)
I find this one easier to read.

Related

How can I partition an Arrow Table by value in one pass?

I would like to be able to partition an Arrow table by the values of one of its columns (assuming the set of n values occurring in that column is known). The straightforward way is a for-loop: for each of these values, scan the whole table and build a new table of matching rows. Are there ways to do this in one pass instead of n passes?
I initially thought that Arrow's support for group-by scans would be the solution -- but Arrow (in contrast to Pandas) does not support extracting groups after a group-by scan.
Am I just thinking about this wrong and there is another way to partition a table in one pass?

For the group by support, there is a "hash_list" function that returns all values in the group. Is that what you're looking for? You could then slice the resulting values after-the-fact to extract the individual groups.

CALCULATE - how does AND logic work when multiple FILTERS are used?

When there are multiple filters, they're evaluated by using the AND logical operator. That means all conditions must be TRUE at the same time.
I understand this when the filters are like:
AMOUNT>100, CATEGORY='Sales'
However, when one or all of the filters are given by FILTER formula, I am unable to visualise how the AND logic works (and what does all conditions must be true mean because a condition [FILTER] itself is a table). Please can you give an example.

All of the filters will be part of an expanded table and ultimately each condition is a set of rules for specific columns on this expanded table.

How to get row count for large dataset in Informatica?

I am trying to get the row count for a dataset with 280 fields with out having affect on the performance. Looking for best possible ways to perform.

The better option to avoid performance issue is, use sorter transformation and sort the columns and pass the pipeline to aggregator transformation. In aggregator transformation please check the option sorted input.
In terms if your source is a database then, index the required conditional columns in the table and also partition the table if required.

For your solution, I have in mind 2 options:
Using Aggregator (remember to use a predefined order by to improve performance with the next trans), SQ > Aggregator > Target. Inside the aggregator add new ports with the sum() and/or count() functions. Remember to select the columns to group
Check this out this example:
https://www.guru99.com/aggregator-transformation-informatica.html
Using Source Qualifier query override. Use a traditional select count/sum with group by from the database- SQ > Target.
By the way. Informatica is very good with the performance, more than the columns you need to review how many records you are processing. A best practice is always to stress the datasource/database more than the Infa app.
Regards,
Juan

If all you need is just to count the rows, use the Aggregator. That's what it's for. However, this will create cache - to limit it's size, use a single port.
To avoid caching, you can use a variable in expression and just increment it. This however will give you an extra column with all rows numbered, not just a single value. You'll still need to aggregate it. Here it would be possible to use aggregater with no function to return just the last value.

Aggregator transformation in Informatica

Is it compulsory to select group-by while performing count operation in the aggregator transformation in Informatica

In order to perform a count, you have to specify atleast one column in group by to AGGREGATOR transformation to let it know that it has to perform grouping on that column.
Even if you don't provide GROUP BY also, the mapping will not fail, but you won't get the expected result.

While using Aggregator transformation, you need to check group by as the result returns each row by performing aggregation one by one and the passes to the pipeline. If no group by is checked, the last row will be processed and it will return only single row (last row) as it has no command to aggregate data. In order to perform count with respect to specific column, it is manditory to check group by for required columns.
If you hesitate to group by, you can use expression transformation and use count function to perform aggregation for required column without grouping.
Thank you

It's not mandatory to select atleast one port as group by. Hiwever, if you don't choose any group by port - Infa will return only last row.
Hope this helps

How to Scan HBase Rows efficiently

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks
-Panks

A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.

You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.

I am just getting started with HBase, bloom filters might help.

You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

To filter specific records - informatica

Related

How can I partition an Arrow Table by value in one pass?

CALCULATE - how does AND logic work when multiple FILTERS are used?

How to get row count for large dataset in Informatica?

Aggregator transformation in Informatica

How to Scan HBase Rows efficiently

Categories

Resources