How the Informatica power center platform works? Like when we create a new user , what all background works processing? How the source data is getting extracted?
Is there any documents to get the background process of Informatica Power Center?
This is ETL tool so, it extracts data form source, get it into infa server, does transformations, then push the data into target box. All these are done using multiple threads, you may not able to see them in documentation but you can see them into session logs.
For a simple SRC> EXP>AGG>TGT type i can explain the back end process.
Step 1 - Infa gets the mappping info from Repo.
Step 2 -
Infa service creates three threads -
a. Read - reads from source and laods into infa memory.
b. Transform - Aggregates data when #1 is done and put into memory.
c. Load data to target when #2 is done.
This can be complex when you have many transformations. Everything is logged into session logs and you will get tons of info from that.
Power Center Informatica offers the capability to connect & fetch data from different heterogeneous source and processing of data. For example, You can connect to any database like SQL Server Database or Oracle database and can integrate the data into a third system. Informatica has its own transformation language that you can use in our expression transformation, filter transformation, source qualifier transformation etc. It is quite versatile and not at all difficult to learn, if you're familiar with any of the most popular programming languages of today.
Related
I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.
Is there anything (system procedure,function or other) in SQL Server that will provide the functionality of DBMS_ALERT package of ORACLE (and DBMS_PIPE respectively)?
I work in a plant and I'm using an extension-product of SQL-Server called InSQL Server by Wonderware which is specialized in gothering data from plant controllers and HumanMachineInterface(SCADA) software.
This system can record events happening in the plant (like a high-temperature alarm, for example). It stores sensor values in extension tables of SQL Sever, and other less dense information in normal SQL Server tables.
I want to be able to alert some applications running on operator PCs that an event has been recorded in the database.
An after insert trigger in the events table seems to be a good place to put something equivalent to DBMS_ALERT (if it exists), to wake up other applications that are waiting for the specific alert and have the operators type in some data.
In other words - I want to be able to notify other processes (that have connection to SQL Server) that something has happened in the database.
All Wonderware (InSQL but now called Aveva) Historian data is stored in the history blocks EXCEPT for the actual tag storage configuration and dedicated event data. The time series data for analog, discrete and strings is NOT in SQL tables at all - unless someone is doing custom configuration to. create tables of their own.
Where are you wanting these notifications to come up? Even though the historical data is NOT stored in SQL tables, Wonderware has extensive documentation on how to use SQL queries to appropriately retrieve data (check for whatever condition you are looking for)
You can easily build a stored procedure and configure it for a maintenance plan.
But are you just trying to alarm (provide notification) on the scada itself?
Or are you truly utilizing historical data (looking for a data trend - average, etc.)?
Or trying to send the notification to non-scada interfaces?
Depending on your specific answer, the scada itself should probably be able to do it.
But there is software that already does this type of thing Win-911, SeQent, Scadatec are a couple in the OT space. But also things like Hip Link or even DeskAlert which can connect to any SQL via it's own API.
So where does the info need to go (email, text, phone, desktop app...) and what is the real source of the data>
The Presto website (and other docs) talk about "interactive queries" on Presto. What is an "interactive query"? From the Presto Website: "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse."
An interactive query system is basically a user interface that translates the input from the user into SQL queries. These are then sent to Presto, which processes the queries and gets the data and sends it back to the user interface.
The UI then renders the output, which is typically NOT just a simple table of numbers and text, but rather a complex chart, a diagram or some other powerful visualization.
The users expects to be able to e.g. update one criteria and get the updated chart or visualization in near real time, just like you expect on any application typically. Even if the creation of this analysis involves LOTS of data to be processed.
Presto can do that since it can query massive distributed object storage systems like HDFS and many other cloud storage systems, as well as RDBMSs and so on. And it can be set up to have a huge cluster of workers that query the source in parallel and therefore process massive amounts of data for analysis, and still be fast enough for the user expectations.
A typical application to use for the visualization is Apache Superset. You just hook up Presto to it via the JDBC driver. Presto has to be configured to point at the underlying data sources and you are ready to go.
first introduce our use case:
The Real-time data analytics platform.
One external system product time-series data every 1s.The time-series data consist of [id,time,value] fields .And it support REST API to search these time-series data.
We have a lot of (more than 100) standlone CPP programs to analyze the time-series data, and these programs would produce KPI into Database. The programs is realtime computation, every CPP programs read datas per second , doing processing per soncond, and send the KPI result to database.
The system architechture we used is so simple:
But the system has problems:
Every second, the external system will receive a huge of http request, this cause a decline in the performance of the external system.
The same situation to DB as 1
We have not suitable tools to manage CPP programs, we don't know when and why they crash.We want to receive the alert when they have any problem.
And lack of the suitable tools,we can only deploy and start the CPP one by one
Many programs would request same time-series data, for example, program A request [ID1, ID2, ID3, ID4] , program B may request [ ID2, ID4, ID6, ID7], program C may request [ID3, ID5,ID5,ID7], so a large number of duplicate data will appear in different requests.
After some investigation, we think WSO2 product is the best choise to solve our problem, and we change the architecture:
We use DAS to search TS data, dispatch data, and collect the KPI result. GREG use to manage the CPP programs lifecycle.
In GREG
Define new artifact type, the type hold the CPP program( .exe or script). We want use the publisher console web to publish new CPP program,
manage the program lifecycle(start/stop/pause/reset), but still in development, and can not fully confirm that can be archived
We want upload the CPP program file to Enterprise Store , and user can subscribe it from GREG publisher
Monitor every CPP program.
In DAS
Create custom receiver, it get the id list from GREG per 30s, and Get time-series data from external system
Create the stream, it persistence event data
Create the execution plan, it used siddhi to reorganize time-series data by each CPP
Create HTTP receiver to receive KPI result that from CPP
Create the publisher to send KPI to external DB store
So are there any problem with our architecture? And is it the best way to use DAS and GREG?
Thanks for your any suggestions.
We have 4 datasources.2 datasources are internal and we can directly connect to the database.For the 3rd datasource we get a flat file (.csv) and have to pull in the data.4rth datasource is external and we cannot access it directly.
We need to pull data from all the 4 datasources, run business rules on them and store them in our database. We have a web application that runs on top of this database.Also every month we have to pull the data and do any updates/deletes/adds etc to existing data.
I am pretty much ignorant about this process.Also Can you please point some good books to study this topic.
These are the current approaches that i was thinking of.
To write an internal webservice that will talk to internal datasoureces and pull data.Create a handler to the external datasource using middleware (mqseries is already setup for this in some other existing project,planning to reuse that).PUll data from csv file again using Java.
On this data run some business rules from Java.Use this data.
This approach might run in my dev box, but not sure what all problems can occur in prod (specially due to synchronization)
Pull data from internal using plain java jdbc connection.For the remaining 2 get flat files, dump data using sql loader.All the data goes to temporary tables first.Run busines rules thru pl/sql and use.
Use some ELT tool like informatica to pull data.write business rules in perl (invoked by informatica)
Thanks.
A book like "The Data Warehouse ETL Toolkit" by Ralph Kimball is a good resource for learning techniques/architectures to bring data from different sources into one place.