Re Auditing tables in Informatica - informatica

I have a stored procedure which initiates the process by calling a BATCH_ID, so in informatica I don't want to call the stored procedure for initiating Batch_ID, Is there any best practice for initializing the batch_id and insert the values into audit table

You can have a pre-process to insert a row into audit table.
Based on your use case if you provide more details, we can add to this.

Related

How to store data(tablenames) extracted from PG_TABLE_DEF into a table/temp table

I will give you the context behind this request. This is related to this post
Redshift cursor doesn't exist after creating a stored procedure. I have a workaround for this by using for loop with row_num window function. In order to do that I need to get the list of table names from PG_TABLE_DEF and store it in a temp table for processing through LOOP within Stored Proc. The challenge is we cannot run certain operations against table like PG_TABLE_DEF where it runs only on LEADER node. Hence I am getting this below error when i tried to copy data from PG_TABLE_DEF into a new temp table through CTAS.
ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
Could someone please help to overcome this scenario.
As you state pg_table_def only exists on the leader node and on Redshift there is no way for a compute node to access this information during a query. So if you need this information on the compute nodes you need to first query it from the leader and then (somehow) route it back to the compute nodes. This can be done in several ways but all require that you fully execute the query on the leader node first.
You can do this with a Lambda function or other externally executed code that reads pg_table_def and then inserts (copys) the data into a normal table. OR you can execute the leader node query into a cursor and then read the cursor with a stored procedure depositing the data into a normal table. These 2 paths do basically the same thing, read the catalog table on the leader node and then put the result of this query into a normal table. I know of no other way to do this.
Here's an answer with code for doing this that I wrote up 2 years ago: How to join System tables or Information Schema tables with User defined tables in Redshift

Best practice of using Dynamo table when it needs to be periodically updated

In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?
I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.

Bigquery, save clusters of clustered table to cloud storage

I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.
What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:
gs://my_bucket/{client_id}/{attribute_id}/object.avro
I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.
But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.
Is there a way to accomplish this already without requesting a new feature to be added?
Thanks!

How to monitor the number of records loaded into BQ table while using big query streaming?

We are trying to insert data into bigquery (streaming) using dataflow. Is there a way where we can keep a check on the number of records inserted into Bigquery? We need this data for reconciliation purpose.
Add a step to your dataflow which calls Google API Tables.get OR run this query before and after the flow (Both are equally good).
select row_count, table_id from `dataset.__TABLES__` where table_id = 'audit'
As an example, the query returns this
You also may be able to examine the "Elements added" by clicking on the step writing to bigquery in the Dataflow UI.

Auditing Tables in Informatica

We want to maintain auditing of tables ,
For that my question is
1)will the commit interval in Informatica will be stored anywhere in any variable ,
so that we can maintain the record count for every commit interval.
2)is there any method/script to read the stats from session log and save in audit table.
3)If there are multiple targets in my mapping then in monitor after executing it will show target success count and target reject count as total for all the targets in the mapping.
how to get individual target success and reject count .
You need to use informatica metadata tables which informatica doesn't recommend(still I am mentioning for your reference). So your options are to create a sh/bat script to get these info from session log or create a maplet that collects this kind of statistics and add that maplet in every infa mappings. To answers your questions -
Yes, commit intervals stored in informatica table opb_task_attr where attr_id=14 and select attr_value.
Nope, either you can use some infa mapplet to collct such stats or some shell script.
Yes this is possible. use infomatica view rep_sess_tbl_log for this purpose. Here you can get each target's statistics of a particular session's run.
Koushik