first introduce our use case:
The Real-time data analytics platform.
One external system product time-series data every 1s.The time-series data consist of [id,time,value] fields .And it support REST API to search these time-series data.
We have a lot of (more than 100) standlone CPP programs to analyze the time-series data, and these programs would produce KPI into Database. The programs is realtime computation, every CPP programs read datas per second , doing processing per soncond, and send the KPI result to database.
The system architechture we used is so simple:
But the system has problems:
Every second, the external system will receive a huge of http request, this cause a decline in the performance of the external system.
The same situation to DB as 1
We have not suitable tools to manage CPP programs, we don't know when and why they crash.We want to receive the alert when they have any problem.
And lack of the suitable tools,we can only deploy and start the CPP one by one
Many programs would request same time-series data, for example, program A request [ID1, ID2, ID3, ID4] , program B may request [ ID2, ID4, ID6, ID7], program C may request [ID3, ID5,ID5,ID7], so a large number of duplicate data will appear in different requests.
After some investigation, we think WSO2 product is the best choise to solve our problem, and we change the architecture:
We use DAS to search TS data, dispatch data, and collect the KPI result. GREG use to manage the CPP programs lifecycle.
In GREG
Define new artifact type, the type hold the CPP program( .exe or script). We want use the publisher console web to publish new CPP program,
manage the program lifecycle(start/stop/pause/reset), but still in development, and can not fully confirm that can be archived
We want upload the CPP program file to Enterprise Store , and user can subscribe it from GREG publisher
Monitor every CPP program.
In DAS
Create custom receiver, it get the id list from GREG per 30s, and Get time-series data from external system
Create the stream, it persistence event data
Create the execution plan, it used siddhi to reorganize time-series data by each CPP
Create HTTP receiver to receive KPI result that from CPP
Create the publisher to send KPI to external DB store
So are there any problem with our architecture? And is it the best way to use DAS and GREG?
Thanks for your any suggestions.
Related
Is there anything (system procedure,function or other) in SQL Server that will provide the functionality of DBMS_ALERT package of ORACLE (and DBMS_PIPE respectively)?
I work in a plant and I'm using an extension-product of SQL-Server called InSQL Server by Wonderware which is specialized in gothering data from plant controllers and HumanMachineInterface(SCADA) software.
This system can record events happening in the plant (like a high-temperature alarm, for example). It stores sensor values in extension tables of SQL Sever, and other less dense information in normal SQL Server tables.
I want to be able to alert some applications running on operator PCs that an event has been recorded in the database.
An after insert trigger in the events table seems to be a good place to put something equivalent to DBMS_ALERT (if it exists), to wake up other applications that are waiting for the specific alert and have the operators type in some data.
In other words - I want to be able to notify other processes (that have connection to SQL Server) that something has happened in the database.
All Wonderware (InSQL but now called Aveva) Historian data is stored in the history blocks EXCEPT for the actual tag storage configuration and dedicated event data. The time series data for analog, discrete and strings is NOT in SQL tables at all - unless someone is doing custom configuration to. create tables of their own.
Where are you wanting these notifications to come up? Even though the historical data is NOT stored in SQL tables, Wonderware has extensive documentation on how to use SQL queries to appropriately retrieve data (check for whatever condition you are looking for)
You can easily build a stored procedure and configure it for a maintenance plan.
But are you just trying to alarm (provide notification) on the scada itself?
Or are you truly utilizing historical data (looking for a data trend - average, etc.)?
Or trying to send the notification to non-scada interfaces?
Depending on your specific answer, the scada itself should probably be able to do it.
But there is software that already does this type of thing Win-911, SeQent, Scadatec are a couple in the OT space. But also things like Hip Link or even DeskAlert which can connect to any SQL via it's own API.
So where does the info need to go (email, text, phone, desktop app...) and what is the real source of the data>
How the Informatica power center platform works? Like when we create a new user , what all background works processing? How the source data is getting extracted?
Is there any documents to get the background process of Informatica Power Center?
This is ETL tool so, it extracts data form source, get it into infa server, does transformations, then push the data into target box. All these are done using multiple threads, you may not able to see them in documentation but you can see them into session logs.
For a simple SRC> EXP>AGG>TGT type i can explain the back end process.
Step 1 - Infa gets the mappping info from Repo.
Step 2 -
Infa service creates three threads -
a. Read - reads from source and laods into infa memory.
b. Transform - Aggregates data when #1 is done and put into memory.
c. Load data to target when #2 is done.
This can be complex when you have many transformations. Everything is logged into session logs and you will get tons of info from that.
Power Center Informatica offers the capability to connect & fetch data from different heterogeneous source and processing of data. For example, You can connect to any database like SQL Server Database or Oracle database and can integrate the data into a third system. Informatica has its own transformation language that you can use in our expression transformation, filter transformation, source qualifier transformation etc. It is quite versatile and not at all difficult to learn, if you're familiar with any of the most popular programming languages of today.
Context
I'm making, a Django web application that depends on scraped API data.
The workflow:
A) I retrieve data from external API
B) Insert structured, processed data that I need in my PostgreSQL database (about 5% of the whole JSON)
I would like to add a third step, (before or after the "B" step) which will store the whole external API response in my databases. For three reasons:
I want to "freeze" data, as an "audit trail" in case of the API changes the content (It happened before)
API calls in my business are expensive, and often limited to 6 months of history.
I might decide to integrate more data from the API later.
Calling the external API again when data is needed is not possible because of 2) and 3)
Please note that the stored API responses will never be updated and read performance is not really important. Also, being able to query the stored API responses would be really nice, to perform exploratory analysis.
To provide additional context, there is a few thousand API calls a day, which represent around 50GB of data a year.
Here comes my question(s)
Should I store the raw JSON in the same PostgreSQL database I'm using for the Django web application, or in a separate datastore (MongoDB or some other NoSQL database)?
If I go with storing the raw JSON in my PostgreSQL database, I fear that my web application performance will decrease due to the database being "bloated" (50Mb of parsed SQL data in my Django database are equivalent to 2GB of raw JSON from the external API, so integrating the full API responses in my database will multiply its size by 40)
What about cost? as all this is hosted on a DBaaS. I understand that the cost will increase greatly (due to the DBs size increase), but is any of the two options more cost effective?
I have around a million data items being inserted into BigQuery using streaming API (BigQuery Python Client's insert_row function), but there's some data loss, ~10,000 data items are lost while inserting. Is there a chance BigQuery might be dropping some of the data? Since there aren't any insertion errors (or any errors whatsoever for that matter).
I would recommend you to file a private Issue Tracker in order for the BigQuery Engineers to look into this. Make sure to provide affected project, the source of the data, the code that you're using to stream into BigQuery along with the client library version.
We have 4 datasources.2 datasources are internal and we can directly connect to the database.For the 3rd datasource we get a flat file (.csv) and have to pull in the data.4rth datasource is external and we cannot access it directly.
We need to pull data from all the 4 datasources, run business rules on them and store them in our database. We have a web application that runs on top of this database.Also every month we have to pull the data and do any updates/deletes/adds etc to existing data.
I am pretty much ignorant about this process.Also Can you please point some good books to study this topic.
These are the current approaches that i was thinking of.
To write an internal webservice that will talk to internal datasoureces and pull data.Create a handler to the external datasource using middleware (mqseries is already setup for this in some other existing project,planning to reuse that).PUll data from csv file again using Java.
On this data run some business rules from Java.Use this data.
This approach might run in my dev box, but not sure what all problems can occur in prod (specially due to synchronization)
Pull data from internal using plain java jdbc connection.For the remaining 2 get flat files, dump data using sql loader.All the data goes to temporary tables first.Run busines rules thru pl/sql and use.
Use some ELT tool like informatica to pull data.write business rules in perl (invoked by informatica)
Thanks.
A book like "The Data Warehouse ETL Toolkit" by Ralph Kimball is a good resource for learning techniques/architectures to bring data from different sources into one place.