I m facing a issue in regard to loading into Partitioned Oracle Target table.
We have 2 sessions having same table in Oracle as Target
a. INSERT data into Partition1
b. UPDATE data in Partition2
We are trying to achieve parallelism in the workflow, and there are more Partitions and sessions to be created for different data but into same table, but different partitions..
Currently when we run both these sessions parallely, the Update session runs successfully, but the INSERT session gets a NOWAIT error.
NOTE: both are loading data for different partitions.
we made the mapping logic into 2 differnt stored procedures(one does INSERT, and another UPDATE), and they run parallely without any lock when executed from DB directly.
We tried mentioning the partition name in Target override too. but with same result.
Can you advice what are the alternatives we have inorder to achieve parallelism into same target table from Informatica.
Thanks in advance
Related
using impala I noticed a deterioration in performance when I perform several times truncate and insert operations in internal tables.
The question is: can refreshing the tables avoid the problem?
So far I have used refresh only for external tables every time I copied files to hdfs to be loaded into the tables themselves.
Many thanks in advance!
Moreno
You can use compute stats instead of refresh.
Refresh is normally used when you add a data file or change something in table metadata - like add column or partition /change column etc. It quickly reloads the metadata. There is another related command invalidate metadata but this is more expensive than refresh and will force impala to reload metadata when table is called in next query.
compute stats - This is to compute stats of the table or columns when around 30% data changed. Its expensive operation but effective when you do frequent truncate and load.
I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.
BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.
From the link above, WRITE_TRUNCATE means:
Specifies that write should replace a table.
The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.
If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE).
I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks
One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.
I have been exploring WSO2 CEP for last couple of days.
I am considering a scenario where a single lookup table could be used in multiple execution plans. As far as I know, only way to store data all data is event table.
My questions are:
Can I load an event table once(may be by one execution plan) and share that table with other execution plans?
If answer of Q1 is NO, then it will be multiple copies of same data storing in different execution plans, right ? Is there any way to reduce this space utilization ?
If event table is not the correct solution what are other options ?
Thanks in Advance,
-Obaid
Event tables would work in your scenario. However, might you need to use RDBMS EventTable or Hazelcast EventTable instead of In-memory event tables. With them, you can share single table data with multiple execution plans.
If you want your data to be preserved even after server shutdown, you should use RDBMS EventTables (with this you can also access your table data using respective DB browsers, i.e., H2 browser, MySQL Workbench, etc...). If you just want to share a single event table with multiple execution plans at runtime, you can go ahead with Hazelcast EventTable.
I have an application that requires me to pull certain information from DB#1 and push it to DB#2 every time a certain entry in a table from DB#1 is updated. The polling rate doesn't need to be extremely fast, but it probably shouldn't be any slower than 1 second.
I was planning on writing a small service using the C++ Connector library, but I am worried about putting too much load on DB#1. Is there a more efficient way of doing this? Such as built in functionality within an SQL script?
There are many methods to accomplish this, so it may be other factors you prefer that drive the approach.
If the SQL Server databases are on the same server instance:
Trigger on the DB1 tables that push to the DB2 tables
Stored procedure (in DB1 or DB2) that uses MERGE to identify changes and sync them to DB2, then use SQL job to call the procedure on your schedule
Enable Change Tracking on database and desired tables, then use stored proc + SQL job to send changes without any queries on source tables
If on different instances or servers (can also work if on same instance though):
SSIS Package to identify changes and push to DB2 (bonus can work with change data capture)
Merge Replication to synchronize changes
AlwaysOn Availability Groups to synchronize entire dbs
Microsoft Sync Framework
Knowing nothing about your preferences or comfort levels, I would probably start with Merge Replication - can be a bit tricky and tedious to setup, but performs very well.
You can create a trigger in DB1 and dblinks in between DB1 and DB2. So you can natively invoke trigger within DB1 and transfer data directly to DB2.