What is the best possible approach to archive the Camunda Process instances ?
Is there a recommended setup to move or separate the historic processes from running processes.
Regards,
Phani
Camunda already separates runtime information (act_ru_* tables) and historic information (act_hi_* tables).
If you would set your history level to "none", no historic information will be written at all, which shows that the engine works without historic tables.
So it is completely up to you, you can delete/archive/move historic information at will.
Typically, you would want to keep the historic information of running and recently finished instances for error analysis, and the management will be interested in historic reports, so there is more to it in deed.
What we do:
archive all historic information of finished processes that have been finished for at least a month.
provide relevant reporting data on a daily basis, so reporting is not done on the act_hi_ tables but a different, custom, schema
hope that helps.
Related
I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are:
The data pipelines widely vary and ingest very high volumes of data. They are developed using spark,python,emr clusters, kafka, Kinesis stream
Any new system that we onboard in the framework, it should be easily
able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created
Our tech stack includes AWS,Python,Spark, Java, so kindly advise related services(AWS Databrew, PyDeequ, Greatexpectations libraries, various lambda event driven services are some I want to focus)
I am also looking for some sort of audit, balance and control mechanism. Auditing the source data, balancing # of records between 2 points and have some automated mechanism to remediate(control) them.
I am looking for testing frameworks for the different data pipelines.
Also for data profiling, kindly advise tools/libraries, Aws data brew, Pandas are some I am exploring.
I know there wont be one specific solution, and hence appreciate all and any different ideas. A flow diagram with Audit, balance and control with automated data validation and testing mechanism for data pipelines can be very helpful.
Thanks!!!
we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.
As part of our Governance initiative and regulatory requirement, we need to produce a Lineage (tractability) report, outlining the flow of data into our Warehouse, and the Reports or Services consuming its data. We are aware that Information Governance Catalog can produce such a report automatically when DataStage is writing data to the Warehouse. Can Information Governance Catalog do the same when we use SQL Scripts or other tooling to read or write information to our Warehouse? Can I view a complete Lineage report, that incorporates such different information?
What are the steps within IGC to document or otherwise define the usage of information to support Data Lineage and Regulatory reporting?
Yes, while we can automate the production of Lineage (traceability) reports for DataStage, IGC does offer facility to document the flow of data for other data movement scripts, tools or processes. This will produce the same Lineage reports, that can be used to satisfy needs for compliance, or build confidence and trust in the use or consumption of data.
At it simplest, IGC allows one to draft a Mapping Document. Essentially a spreadsheet that delineates the Data Source and Data Target, and documentation to support the transformation, aggregation or other logic. The spreadsheet can be directly authored in IGC, or loaded from Excel (text file) which further supports automation of the process. Documentation for Extension Mapping Documents can be found here: https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.5.0/com.ibm.swg.im.iis.mdwb.doc/topics/c_extensionMappings.html (though suggest creating such a document from IGC, and exporting the results to Excel).
In addition, IGC supports a more formal process for extending the Catalog and introducing new types of Assets. This would go one step further, and properly document and catalog the Data Processes (SQL commands, other ETL tooling) and map the data movement thru those Processes. This will allow users to identify with the Data Process and even allow one to include operational data (as is supported for IGC). More information on this process can be found here: https://www-01.ibm.com/support/docview.wss?uid=swg21699130
Suggest to review the absolute requirements, and what information is required for the ensuing traceability report. Starting with the Extension Mapping Document should suffice, and would be the simplest to implement and drive immediate benefit.
I am redesigning a small monolith ETL software written in Python. I find a microservice architecture suitable as it will give us the flexibility to use different technologies if needed (Python is not the nicest language for enterprise software in my opinion). So if we had three microservices (call them Extract, Transform, Load), we could use Java for Transform microservice in the future.
The problem is, it is not feasible here to pass the result of a service call in an API response (say HTTP). The output from Extract is going to be gigabytes of data.
One idea is to call Extract and have it store the results in a database (which is really what that module is doing in the monolith, so easy to implement). In this case, the service will return only a yes/no response (was the process successful or not).
I was wondering if there were a better way to approach this. What would be a better architecture? Is what I'm proposing reasonable?
If your ETL process works on individual records (some parallelize-able units of computation), then there are a lot of options you could go with, here are a few:
Messaging System-based
You could base your processing around a messaging system, like Apache Kafka. It requires a careful setup and configuration (depending on durability, availability and scalability requirements of your specific use-cases), but may give you a better fit than a relational db.
In this case, the ETL steps would work completely independently, and just consume some topics, produce into some other topics. Those other topics are then picked up by the next step, etc. There would be no direct communication (calls) between the E/T/L steps.
It's a clean and easy to understand solution, with independent components.
Off-the-shelf processing solutions
There are a couple of OTS solutions for data processing/computation and transformation: Apache Flink, Apache Storm, Apache Spark.
Although these solutions would obviously confine you to one particular technology, they may be better than building a similar system from scratch.
Non-persistent
If the actual data is streaming/record-based, and it is not required to persist the results between steps, you could just get away with long-polling the HTTP output of the previous step.
You say it is just too much data, but that data doesn't have to go to the database (if it's not required), and could just go to the next step instead. If the data is produced continuously (not everything in one batch), on the same local network, I don't think this would be a problem.
This would be technically very easy to do, very simple to validate and monitor.
I would suggest you to have a look into the Apache flink, It is very similar to what big sized enterprise apps like informatica, talend and data stage mappings but it process in a smaller scale but repetitively. It actually helps you to compute and transform the stuff on the fly/as they arrive and then store/load into a file/db.
The current infra we have with flink process close 28.5GB per every 4 hours and it just works. In the initial days, we had to run our daily batch and the flink stream to ensure both of them are producing consistent results and eventually most of the streams were left active and the daily batches were retired gradually.
Hope it helps someone.
There's none preventing you to have an SFTP server containing CSV or database storing the results. You can do whatever make senses. Using messaging to pass gigabytes of data, or streaming through HTTP may or may not make senses for your case.
This is an interesting problem. The best solution for this could be Reactive Spring Boot. You can have your Extract service to be as a Reactive Spring Boot app and instead of sending GBs of data, stream the data to the required service.
Now you might be wondering that while streaming, it might hold on the working thread. The answer is NO. IT works at the OS level. It doesn't hold up any request thread to stream the results. That's the beauty of the Reactive Spring Boot.
Go through this and explore
https://spring.io/blog/2016/07/28/reactive-programming-with-spring-5-0-m1
I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.
Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.
In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.