How to create a Range partition in Azure SQL Data Warehouse? - azure-sqldw

I am in the process of migration Oracle 12c to Azure SQL Data warehouse, and i am currently creating the DDLs of Oracle tables.
My question is, how can i create "Range partition" by date in Azure SQL DW?
How do i convert this existing code in Oracle to Azure SQL DW?
PARTITION BY RANGE ("LOG_DATE") INTERVAL (NUMTODSINTERVAL(1, 'DAY')) (PARTITION "PART_01" VALUES LESS THAN (TO_DATE(' 2018-10-02 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) SEGMENT CREATION IMMEDIATE
Appreciate any help from your end.

I understand this statement to move any date prior to 2018-10-02 into one partition, then dynamically create new partitions for each day as rows are received.
There is no direct equivalent of this syntax in Azure SQL Data Warehouse.
The technique that would appear to meet your need is dynamic partition management as described in the following documentation:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition#table-partitioning-source-control

Related

Data Warehouse and connection to Power Bi on AWS Structure

I work for a startup where I have created several dashboards in Power BI using tables that are stored in an AWS RDS that I connect to using MySQL. To create additional columns for visualizations, I created views of the tables in MySQL and used DAX to add some extra columns for the Power BI visualizations.
However, my team now wants to use the AWS structure and build a data lake to store the raw data and a data warehouse to store the transformed data. After researching, I believe I should create the additional columns in either Athena or Redshift. My question is, which option is best for our needs?
I think the solution is to connect to the RDS using AWS Glue to perform the necessary transformations and deposit the transformed data in either Athena or Redshift. Then, we can connect to the chosen platform using Power BI. Please let me know if I am misunderstanding anything.
To give an approximate number of the number of records I'm handling, the fact tables have about 10 thousand new records every month
Thank you in advance!

gcp BigQuery for Dimensional Star Schema Data Warehouse build performance

Google states that BigQuery is for DWH'es that have append by and large and fewer updates.
For a star schema based DWH with optional fact table attributes that could be updated and dimensions that are historized, then is this a goer, or do we need the Redshift approach of small staging tables generated with the new or updated data that needs to be part of an UPSERT query?
Is this type of approach possible in BigQuery using Spark?
spark.sql(""" MERGE INTO CUSTOMERS_AT_REST
USING CUST_DELTA
ON CUSTOMERS_AT_REST.col_key = CUST_DELTA.col_key
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
""")
It is all good on delta on gcp cloud storage.

how to create partition and cluster on an existing table in big query?

In SQL Server , we can create index like this. How do we create the index after the table already exists? What is the syntax of create clusted index in bigquery?
CREATE INDEX abcd ON `abcd.xxx.xxx`(columnname )
In big query, we can create table like below. But how to create partition and cluster on an existing table?
CREATE TABLE rep_sales.orders_tmp PARTITION BY DATE(created_at) CLUSTER BY created_at AS SELECT * FROM rep_sales.orders
As #Sergey Geron mentioned in the comments, BigQuery doesn’t support indexes. For more information, please refer to this doc.
An existing table cannot be partitioned but you can create a new partitioned table and then load the data into it from the unpartitioned table.
As for clustering of tables, BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table. This method of updating the clustering column set is useful for tables that use continuous streaming inserts because those tables cannot be easily swapped by other methods.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Note: When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward. For example, a non-clustered 1 PB table that is converted to a clustered table using tables.update still has 1 PB of non-clustered data. Automatic re-clustering only applies to any new data committed to the table after the update.

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.

informatica powercenter express pass variable to multiple mappings

Background: I am new to Informatica. Informatica powercenter express Version: 9.6.1 HotFix 2
In my etl project I have several mappings to load different dimension and fact tables in a data mart. The ETL will run daily, one requirement is to add a audit key as a column to each of these tables. The audit key is an integer and is generated from a audit table (next value from the audit key column (primary key)). So everyday the audit key is increased by 1 etc. So after each etl load, all the new or updated rows in all tables (dimension/fact) will have this audit key in a column. The purpose is the ability to trace when or how each row is inserted/updated etc.
Now the question is how to generate such key and pass on to all the mappings? The key should be from the next value from auditkey column of audit table.
You could build a mapplet that generates/maintains the key you want and use it in all your workflows
If you have a RDBMS source, I would suggest creating a oracle sequencer in the DB and create oracle function to get the next value...
Call the the newly created oracle function in SQL Override and use the next value sequence number in all the mapping