Informatica CDC Mapping: Group Source is fetching records Slowly - informatica

We have 37 Informatica Sessions in which most of the Sessions have around 25 tables on average. Few sessions have 1 table as source and target. Our Source is Oracle and target is Greenplum database. We are using Powerexchange 10.1 installed on Oracle to fetch our Changed records.
We have noticed that for the sessions having more tables it is taking more time to fetch the data and update in target. Does adding more tables make any delay in Processing? In that case How to tune to fetch the records as fast as possible?

We run 19 CDC mappings with between 17 and 90 tables in each, and have recently had a breakthrough in performance. The number of tables is not the most significant limiting factor for us, power center and power exchange is. Our source is DB2 on z/OS, but that is probably not important ...
This is what we did:
1) we increased the DTM buffer block-size to 256KB, and DTM buffer size to 1GB or more, a 'complex' mapping needs many buffer blocks.
2) we change the connection attributes to:
- Realtime flush latency=86000 (max setting)
- Commit-size in session were set extremely high (to allow the above setting to be the deciding factor)
- OUW count=-1 (Same reason as above)
- maximum rows per commit=0
- minimum rows per commit=0
3) we set the session property 'recovery strategy' to 'fail task and continue workflow' and implemented our own solution to create a 'restart token file' from scratch every time the workflow starts.
Only slightly off topic: The way we implemented this was with an extra table (we call it a SYNC table) containing one row only. That row is being updated every 10 minutes on the source by very a reliable scheduled process (a small CICS program). The content of this table is written to the target database once per workflow and an extra column is added in the mapping, that contains the content of $$PMWorkflowName. Apart from the workflowname column, the two DTL__Restart1 and *2 columns is written to the target as well.
During startup of the workflow we run a small reusable session before the actual CDC session which reads the record for the current workflow from the SYNC table on the target side and creates the RESTART file from scratch.
[please note that you will end up with dublicates from up to 10 minutes (from workflow start time) in the target. We accept that and are aggregating it away in all mappings reading from these]
Try to tinker with combinations of these and tell what you experience. We now have a maximum throughput in a 10 minute interval of 10-100 million rows per mapping. Our target is Netezza (aka PDA from IBM)
One more thing I can tell you:
Every time a commit is triggered (each 86 seconds with the above settings) power center will empty all its writer buffers against all of the tables in one big commit scope. If either of these is locked by another process, you may end up with a lot of cascaded locking on the writer side, which will make the CDC seem slow.

Related

Joining Solution using Co-Group by SideInput Apache Beam

I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.
It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs

Is there a system DMV to monitor the files being loaded?

I'm loading files into Azure DW from blob store using polybase.
I usually use sys.dm_pdw_exec_requests and sys.dm_pdw_sql_requests to see what any long running processes are doing, but polybase loads have limited information.
Is there a fiew that can show the list of files Polybase has found in the directory and indicate any kind of progress (maybe completed files or rows loaded?)
We're still adding to the functionality around Polybase monitoring.
Here is a query that will help you to monitor the progress of the current files being loaded. "Current" means that if there are 1,000 files in a data set, and Polybase is processing them 10 at a time, only 10 rows should result from this query at any given time.
-- To track bytes and files
SELECT
r.command,
s.request_id,
r.status,
count(distinct input_name) as nbr_files,
sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
sys.dm_pdw_exec_requests r
inner join sys.dm_pdw_dms_external_work s
on r.request_id = s.request_id
GROUP BY
r.command,
s.request_id,
r.status
ORDER BY
nbr_files desc,
gb_processed desc;
This is an increasingly important topic, and I've created a User Voice task to register user support. Would you mind adding your votes/comments?

Informatica PowerExchange CDC Data results in target DB way too slow

First of all, I'm very new to Informatica PowerCenter and PowerExchange.
We are using Informatica PowerCenter and PowerExchange to receive CDC data from our source DB2 to a PostgreSQL DB. Therefore we have one workflow where 7 tables are mapped and we get the result in our PostgreSQL. It works fine so far, but it's lacking performance. Not that the size of data is the problem, it's more the delay I see results in the target DB.
When I insert or delete some data on the DB2 (just like 10 rows in one db), I see the results in our PostgreSQL mostly in about ~10-30 seconds (very rare in less than 5 seconds).
My goal would be to speed up this delay. Is this possible? What would I need for that?
I played a little bit with commit interval, and DTM Buffer size, but nothing helped pretty much.
Also I have the feeling that when I configure the workflow to run continuously, it's even slower, compared to when I execute the workflow, after I made the Inserts/Deletes.
Thanks in advance

Amount of Test Data needed for load testing of a web service

I am currently working on a project that requires load testing of web services.
One of the services is being called 60,000 times in the production during Busy-Day/Busy-HR.
{PerfTest Env=PROD}
Input Account Number
Output AccountDetails
Do I really need 60,000 unique account numbers(TEST DATA) for this loadrunner script to simulate the production scenario?
If unique data is required, for endurance test I will have to prepare lot of test data for each web service.
If I don't get that much test data, what is the chance of Load Test being affected due to Application Server Cache mechanism??
Can somebody help me?
Thanks
Ram
Are you simulating a day or the highest volume hour in the last year? This can help you to shape the amount of data that you need. Rarely would you start with a 24 hour test. Instead you would be looking at your high water test of an hour with a ramp up and ramp down, so you would need approximately 1.333* your high water hour's worth of data.
So this can drop your 60K to (potentially) 20K(?) I am making an assumption that your worst hour over the last year is somewhere around 1/3 of your traditional day. I have observed this pattern over and over again in different environments over the past two decades. You will want to objectively verify this with log data or query data to support the number in your environment.
Next up, how many of these inquiries are actually unique? You are really going to need a log of the queries across a day (or your high water hour) to determine this. Log processing tools such as Microsoft Logparser or Splunk/Splunk Storm can help you to pull the observed distribution of unique account references within your data, including counts of those which are multiple. Once you know this you can simply use a data file with a fixed block size for each user for unique data and once the data is exhausted the user exits.

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.