Incremental aggregation in Informatica

Incremental aggregation in Informatica - informatica

I am beginner in informatica and while studying about incremental aggregation, I got stuck with this point:
We cannot use incremental aggregation when the mapping includes an Aggregator transformation with Transaction transformation scope. The Workflow Manager marks the session invalid.
I've searched a lot about it but did not get any answer. Please help.

When you use the transformation - transaction control in your mapping you are telling integration service how the incoming records are grouped i.e. by a row, by transaction or by all input rows.
With transformation scope of "transaction" integration service honors the transaction boundaries. That means all the caches used during processing each transactions are reset once a new transaction is received.
To use incremental aggregator integration service must preserve the aggregator cache which is not possible when using transaction transformation scope.

Related

DynamoDB - UUID and avoiding a full table scan

This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?

Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.

There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.

Process flow of Informatica

How the Informatica power center platform works? Like when we create a new user , what all background works processing? How the source data is getting extracted?
Is there any documents to get the background process of Informatica Power Center?

This is ETL tool so, it extracts data form source, get it into infa server, does transformations, then push the data into target box. All these are done using multiple threads, you may not able to see them in documentation but you can see them into session logs.
For a simple SRC> EXP>AGG>TGT type i can explain the back end process.
Step 1 - Infa gets the mappping info from Repo.
Step 2 -
Infa service creates three threads -
a. Read - reads from source and laods into infa memory.
b. Transform - Aggregates data when #1 is done and put into memory.
c. Load data to target when #2 is done.
This can be complex when you have many transformations. Everything is logged into session logs and you will get tons of info from that.

Power Center Informatica offers the capability to connect & fetch data from different heterogeneous source and processing of data. For example, You can connect to any database like SQL Server Database or Oracle database and can integrate the data into a third system. Informatica has its own transformation language that you can use in our expression transformation, filter transformation, source qualifier transformation etc. It is quite versatile and not at all difficult to learn, if you're familiar with any of the most popular programming languages of today.

Complex events spread over years

I have a scenario whereby if part of a query matches an event, I want to fetch some other events from a datastore to test against the rest of the query
eg. "If JANE DOE buys from my store did she buy anything else over last 3 years" sort of thing.
Does Flink, Storm or WSO2 provide support for such complex event processing?

Flink can do this, but it would require that you process all events starting from the earliest that you care about (e.g. 3 years ago), so that you can construct the state for each customer. Flink then lets you manage this state (typically with RocksDB) so that you wouldn't have to replay all the events in the face of system failures.
If you can't replay all of the history, then typically you'd put this into some other store (Cassandra/HBase, Elasticsearch, etc) with the scalability and performance characteristics you need, and then use Flink's async function support to query it when you receive a new event.

WSO2 Stream processor let’s you implement such functionality with it’s time incremental analytics feature. To implement the scenario you’ve mentioned you can feed the events that are triggered when a customer arrives to an construct called ‘aggregate’. When you keep feeding events to an aggregate it will summarize data over time and will saved in a configured persistence store such as DB.
You can query this aggregate to get the state for a given period of time. For an example following query will fetch the name, total items bought and avg transaction value with the year 2014-2015
from CustomerSummaryRetrievalStream as b join CustoemrAggregation as a
on a.name == b.name
within "2014-01-01 00:00:00 +05:30", "2015-01-01 00:00:00 +05:30"
per “years”
select a.name, a.total, a.avgTxValue
insert into CustomerSummaryStream;

API Gateway generating 11 sql queries per second on REG_LOG

We have sysdig running on our WSO2 API gateway machine and we notice that it fires a large number of SQL queries to the database for a minute, than waits a minute and repeats.
The query looks like this:
Every minute it goes wild, waits for a minute and goes wild again with a request of the following format:
SELECT REG_PATH, REG_USER_ID, REG_LOGGED_TIME, REG_ACTION, REG_ACTION_DATA
FROM REG_LOG
WHERE REG_LOGGED_TIME>'2016-02-29 09:57:54'
AND REG_LOGGED_TIME<'2016-03-02 11:43:59.959' AND REG_TENANT_ID=-1234
There is no load on the server. What is causing this? What can we do to avoid this?
screen shot sysdig api gateway process

This particular query is the result of the registry indexing task that runs in the background. The REG_LOG table is being queried periodically to retrieve the latest registry actions. The indexing task cannot be stopped. However, one can configure the frequency of the indexing task through the following parameter that is in the registry.xml. See [1] for more information.
indexingFrequencyInSeconds
If this table is filled up, one can clean the data using a simple SQL query. However, when deleting the records, one must be careful not to delete all the data. The latest records of each resource path should be left in the REG_LOG table since reindexing of data requires at least one reference of each resource path.
Also, if required, before clearing up the REG_LOG table, you can take a dump of the data in case you do not want to loose old records. Hope this answer provides information you require.
[1] - https://docs.wso2.com/display/Governance510/Configuration+for+Indexing

Insertion with two or more tables from Wso2 ESB with DSS

I have one query:-
In my ESB 4.7.0, Dss is 3.0.1
I wish to insert the data reliably into database,for that i am getting one array list from client.
That array i need to insert into 3 different tables.Each table gave to me returned generated key.which will help to insert into
2nd table,same process for 3rd table.
for this i am using 3 different insertion operations in wso2esb using wso2dss,the insertion is happening nicely ..
#my issue is while i am inserting into 2nd or 3rd table the error has been occurred due to network issue or any data related issue.
In that case my transaction could be roll back.i have done in transaction mediator but it's helpful for within the sequence .
it's not reflecting to any other sequence , so how could i do this.for this may i use any class mediator or any new thing.

The transaction mediator is designed to cater the atomicity requirement. Since you are using insertion only without the involvement of deletion you can pass the primary key of inserted entry in the first table to a class mediator and delete but i think, in this case atomicity will not be guranteed. Therefore the concept of transaction is not achieved in this case.

Since you are using three different opperations you can use DSS boxcarring feature along with Query Request Export feature which enables you to do transactions in a coordinated way. Please refer this to see how you can use boxcarring feature. It allows individual queries executed in a boxcarring session to communicate with each other. The concept is 'exporting' a specific result element so that the next calling query will get that result element as a query parameter. So, if you've two queries, namely, 'query1' and 'query2' that's executed sequentially in a boxcarring session, and if 'query1' has a specific result element and that element is exported with the name 'foo', then 'query2' also gets a query param named 'foo'. So when this boxcarring session is executed, the query1's exported value will be passed into query2 as an input parameter.

For your requirement the ideal solution is to use Boxcarring. Boxcarring is a method of grouping a set of service calls together and executing them at once. Where applicable, a boxcarring session works in a transactional manner such as when used with an RDBMS data source.
The 'Data Service Hosting' feature facilitates boxcarring by grouping service calls in the server side. As a result, special service clients are not required and as usual, successive service calls can be made to the server to participate in a boxcarring session.
For boxcarring to function, a transport that supports session management, such as HTTP, must be used. The service client should also support session management by returning back session cookies when sent by the server. Axis2 Service Clients have full support for session management.
Please find the WSO2 original documentation on boxcarring and this useful blog post that explain how to work with boxcarring step by step.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Incremental aggregation in Informatica - informatica

Related

DynamoDB - UUID and avoiding a full table scan

Process flow of Informatica

Complex events spread over years

API Gateway generating 11 sql queries per second on REG_LOG

Insertion with two or more tables from Wso2 ESB with DSS

Categories

Resources