I need to sync data from 28.02.2022 to 1.04.2020 in descending order. How can I achieve this in Informatica without manually passing the dates?
You can simply add a sorter based on date before target to write data into ascending order. But if your table is a table this loading process inst important because in DB, data can be stored in any order (unless target is MS SQL Server).
If your target is a file, the order will work properly.
So your mapping will look like this -
SQ--> EXP--...-->SRT_keys --> Target
if you want to read data in a particular order, you can just add the sorter right after SQ so it processes in a particular order.
So your mapping will look like this -
SQ-->SRT_keys--> EXP--... --> Target
Related
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.
I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id
I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.
That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.
As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.
If I run a CREATE EXTERNAL TABLE cetasTable AS SELECT command then run:
EXPLAIN
select * from cetasTable
I see in the distributed query plan:
<operation_cost cost="4231.099968" accumulative_cost="4231.099968" average_rowsize="2056" output_rows="428735" />
It seems to know the correct row count, however, if I look there are no statistics created on that table as this query returns zero rows:
select * from sys.stats where object_id = object_id('cetasTable')
If I already have files in blob storage and I run a CREATE EXTERNAL TABLE cetTable command then run:
EXPLAIN
select * from cetTable
The distributed query plan shows SQL DW thinks there are only 1000 rows in the external table:
<operation_cost cost="4.512" accumulative_cost="4.512" average_rowsize="940" output_rows="1000" />
Of course I can create statistics to ensure SQL DW knows the right row count when it creates the distributed query plan. But can someone explain how it knows the correct row count some of the time and where that correct row count is stored?
What you are seeing is the difference between a table created using CxTAS (CTAS, CETAS or CRTAS) and CREATE TABLE.
When you run CREATE TABLE row count and page count values are fixed as the table is empty. If memory serves the fixed values are 1000 rows and 100 pages. When you create a table with CTAS they are not fixed. The actual values are known to the CTAS command as it has just created and populated the table in a single command. Consequently, the metadata correctly reflects the table SIZE when a CxTAS is used. This is good. The APS / SQLDW cost based optimizer can immediately make better estimations for MPP plan generation based on table SIZE when a table has been created via CxTAS as opposed to CREATE table.
Having an accurate understanding of table size is important.
Imagine you have a table created using CREATE TABLE and then 1 billion rows are inserted using INSERT into said table. The shell database still thinks that the table has 1000 rows and 100 pages. However, this is clearly not the case. The reason for this is because the table size attributes are not automatically updated at this time.
Now imagine that a query is fired that requires data movement on this table. Things may begin to go awry. You are now more likely to see the engine make poor MPP plan choices (typically using BROADCAST rather than SHUFFLE) as it does not understand the table size amongst other things.
What can you do to improve this?
You create at least one column level statistics object per table. Generally speaking you will create statistics objects on all columns used in JOINS, GROUP BYs, WHEREs and ORDER BYs in your queries. I will explain the underlying process for statistics generation in a moment. I just want to emphasise that the call to action here is to ensure that you create and maintain your statistics objects.
When CREATE STATISTICS is executed for a column three events actually occur.
1) Table level information is updated on the CONTROL node
2) Column level statistics object is created on every distribution on the COMPUTE nodes
3) Column level statistics object is created and updated on the CONTROL node
1) Table level information is updated on the CONTROL node
The first step is to update the table level information. To do this APS / SQLDW executes DBCC SHOW_STATISTICS (table_name) WITH STAT_STREAM against every physical distribution; merging the results and storing them in the catalog metadata of the shell database. Row count is held on sys.partitions and page count is held on sys.allocation_units. Sys.partitions is visible to you in both SQLDW and APS. However, sys.allocation_units is not visible to the end user at this time. I referenced the location for those familiar with the internals of SQL Server for information and context.
At the end of this stage the metadata held in the shell database on the CONTROL node has been updated for both row count and page count. There is now no difference between a table created by CREATE TABLE and a CTAS - both know the size.
2) Column level statistics object is created on every distribution on the COMPUTE nodes
The statistics object must be created in every distribution on every COMPUTE node. By creating a statistics object important, detailed statistical data (notably the histogram and the density vector) for the column has been created.
This information is used by APS and SQLDW for generating distribution level SMP plans. SMP plans are used by APS / SQLDW in the PHYSICAL layer only. Therefore, at this point the statistical data is not in a location that can be used for generating MPP plans. The information is distributed and not accessible in a timely fashion for cost based optimisation. Therefore a third step is necessary...
3) Column level statistics object is created and updated on the CONTROL node
Once the data is created PHYSICALLY on the distributions in the COMPUTE layer it must be brought together and held LOGICALLY to facilitate MPP plan cost based optimisation. The shell database on the CONTROL node also creates a statistics object. This is a LOGICAL representation of the statistics object.
However, the shell database stat does not yet reflect the column level statistical information held PHYSICALLY in the distributions on the COMPUTE nodes. Consequently, the statistics object in the shell database on the CONTROL node needs to be UPDATED immediately after it has been created.
DBCC SHOW_STATISTICS (table_name, stat_name) WITH STAT_STREAM is used to do this.
Notice that the command has a second parameter. This changes the result set; providing APS / SQLDW with all the information required to build a LOGICAL view of the statistics object for that column.
I hope this goes some way to explaining what you were seeing but also how statistics are created and why they are important for Azure SQL DW and for APS.
I have a workflow which writes data from a table into a flatfile. It works just fine, but I want to insert a blank line inbetween each records. How can this be achieved ? Any pointer ?
Here, you can create 2 target instances. One with the proper data and in other instance pass blank line. Set Merge Type as "Concurrent Merge" in session properties.
Multiple possibilities -
You can prepare appropriate dataset into a relational table, and afterwards, dump data from that into a flat file. For preparation of that data set, you can insert blank rows into that relational target.
Send a blank line to a separate target file (based on some business condition using a router or something similar), after that you can use merge files option (in session config) to get that data into a single file.