Background: I am new to Informatica. Informatica powercenter express Version: 9.6.1 HotFix 2
In my etl project I have several mappings to load different dimension and fact tables in a data mart. The ETL will run daily, one requirement is to add a audit key as a column to each of these tables. The audit key is an integer and is generated from a audit table (next value from the audit key column (primary key)). So everyday the audit key is increased by 1 etc. So after each etl load, all the new or updated rows in all tables (dimension/fact) will have this audit key in a column. The purpose is the ability to trace when or how each row is inserted/updated etc.
Now the question is how to generate such key and pass on to all the mappings? The key should be from the next value from auditkey column of audit table.
You could build a mapplet that generates/maintains the key you want and use it in all your workflows
If you have a RDBMS source, I would suggest creating a oracle sequencer in the DB and create oracle function to get the next value...
Call the the newly created oracle function in SQL Override and use the next value sequence number in all the mapping
Related
I'm trying to set up a Postgresql migration using the DMS to s3 as target. But after running I noticided that some tables were missing some columns.
After checking the logs I noticed this message:
Column 'column_name' was removed from table definition 'schema.table': the column data type is LOB and the table has no primary key or unique index
In the settings of the task migration I tried to increase the lob limit in the option
Maximum LOB size to 2000000
But still getting the same result.
Does anyone know a workaround for this problem?
I guess, the problem is you do not have the primary key in your table.
From AWS documentation:
Currently, a table must have a primary key for AWS DMS to capture LOB
changes. If a table that contains LOBs doesn't have a primary key,
there are several actions you can take to capture LOB changes:
Add a primary key to the table. This can be as simple as adding an ID
column and populating it with a sequence using a trigger.
Create a materialized view of the table that includes a
system-generated ID as the primary key and migrate the materialized
view rather than the table.
Create a logical standby, add a primary key to the table, and migrate
from the logical standby.
Learn more
It is also important to have the primary key of a simple type, not LOB:
In FULL LOB or LIMITED LOB mode, AWS DMS doesn't support replication of primary keys that are LOB data types.
Learn more
Right now I fetch columns and data type of BQ tables via the below command:
SELECT COLUMN_NAME, DATA_TYPE
FROM `Dataset`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name="User"
But if I drop a column using command : Alter TABLE User drop column blabla:
the column blabla is not actually deleted within 7 days(TTL) based on official documentation.
If I use the above command, the column is still there in the schema as well as the table Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
It is just that I cannot insert data into such column and view such column in the GCP console. This inconsistency really causes an issue.
If I want to write bash script to monitor schema changes and do some operation based on it.
I need more visibility on the table schema of BigQuery. The least thing I need is:
Dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS can store a flag column that indicates deleted or TTL:7days
My questions are:
How can I fetch the correct schema in spanner which reflects the recently deleted the column?
If the column is not actually deleted, is there any way to easily restore it?
If you want to fetch the recently deleted column you can try searching through Cloud Logging. I'm not sure what tools Spanner supports but if you want to use Bash you can use gcloud to fetch logs. Though it will be difficult to parse the output and get the information you want.
Command used below fetched the logs for google.cloud.bigquery.v2.JobService.InsertJob since an ALTER TABLE is considered as an InsertJob and filter it based from the actual query where it says drop. The regex I used is not strict (for the sake of example), I suggest updating the regex to be stricter.
gcloud logging read 'protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob" AND protoPayload.metadata.jobChange.job.jobConfig.queryConfig.query=~"Alter table .*drop.*"'
Sample snippet from the command above (Column PADDING is deleted based from the query):
If you have options other than Bash, I suggest that you create a BQ sink for your logging and you can perform queries there and get these information. You can also use client libraries like Python, NodeJS, etc to either query in the sink or directly query in the GCP Logging.
As per this SO answer, you can use the time travel feature of BQ to query the deleted column. The answer also explains behavior of BQ to retain the deleted column within 7 days and a workaround to delete the column instantly. See the actual query used to retrieve the deleted column and the workaround on deleting a column on the previously provided link.
In SQL Server , we can create index like this. How do we create the index after the table already exists? What is the syntax of create clusted index in bigquery?
CREATE INDEX abcd ON `abcd.xxx.xxx`(columnname )
In big query, we can create table like below. But how to create partition and cluster on an existing table?
CREATE TABLE rep_sales.orders_tmp PARTITION BY DATE(created_at) CLUSTER BY created_at AS SELECT * FROM rep_sales.orders
As #Sergey Geron mentioned in the comments, BigQuery doesn’t support indexes. For more information, please refer to this doc.
An existing table cannot be partitioned but you can create a new partitioned table and then load the data into it from the unpartitioned table.
As for clustering of tables, BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table. This method of updating the clustering column set is useful for tables that use continuous streaming inserts because those tables cannot be easily swapped by other methods.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Note: When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward. For example, a non-clustered 1 PB table that is converted to a clustered table using tables.update still has 1 PB of non-clustered data. Automatic re-clustering only applies to any new data committed to the table after the update.
I am in the process of migration Oracle 12c to Azure SQL Data warehouse, and i am currently creating the DDLs of Oracle tables.
My question is, how can i create "Range partition" by date in Azure SQL DW?
How do i convert this existing code in Oracle to Azure SQL DW?
PARTITION BY RANGE ("LOG_DATE") INTERVAL (NUMTODSINTERVAL(1, 'DAY')) (PARTITION "PART_01" VALUES LESS THAN (TO_DATE(' 2018-10-02 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) SEGMENT CREATION IMMEDIATE
Appreciate any help from your end.
I understand this statement to move any date prior to 2018-10-02 into one partition, then dynamically create new partitions for each day as rows are received.
There is no direct equivalent of this syntax in Azure SQL Data Warehouse.
The technique that would appear to meet your need is dynamic partition management as described in the following documentation:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition#table-partitioning-source-control
I'm just starting to experiment with AWS glue and I've successfully been able to pull data from my Aurora MySQL environment into my PostgreSQL DB. When the crawler creates the data catalog for the table I'm experimenting with, all the columns are out of order, and then when the job creates the destination table, the columns again are out of order, I'm assuming because it's created based off of what the crawler generated. How can I make the table structure in the catalog match what's in the source DB?
You can simply open the tabke that create by crawler then click on "edit schema", click on the number at the start of each row and change them, that are the order number of the rows.