Restoring database on Azure SQL Data Warehouse - azure-sqldw

What variables determine the length of time it takes to restore an Azure SQL Data Warehouse database?
I am creating a new ADW database on a new logical server (in the same Azure region) using the Azure portal and the source specified as the backup for another ADW database (that has ~100TB of uncompressed data loaded to compressed columnar tables).
The Azure Portal reports the status as "Deploying", but am unclear whether this will be an operation taking minutes, hours, or days.
Is there any way to track the progress?

Restore time for Azure SQL Data Warehouse is primarily determined by two factors: the size of the database that you are restoring, and the location you are restoring it into. The fastest restore will be in the paired region, the second will be the original region, and finally everywhere else.

Related

Data Fusion replication pipeline is not syncing data in Google Bigquery

Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins
The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.

Power BI report data storage concept

I am looking for an answer regarding report data storage concept in Power BI.
I have published 3 reports to Power BI service (cloud):
Report1 with Excel source
Report2 with onpremise Sql server source
Report3 with azure sql source
Around 200 users in my organization will be accessing these reports. I want to understand whether:
The first time a particular report is accessed, will the data be fetched from the source and shown in the report or will it be stored to some cloud location from where the data will go to the report?
Suppose a user opens a report that was already viewed by another user, then will the data be fetched from the source again or is there any concept of cross user shared cache?
Suppose a user opens the report for the 2nd time (example: after having already accessed it, suppose user refreshes the web page), will the data will be fetched again? Or is there any concept of shared cache?
Does the answer to any of the above change if I had used the Power BI reporting server (onpremise) and deployed the report on the PBRS?
With the service, you typically upload a PBIX, which contains the report pages and all of the underlying data. Unless you set up a data gateway to accommodate DirectQueries and/or scheduled refreshes, the cloud service does not access your original data sources at all. With a scheduled refresh, it only accesses the original data during the refresh. A DirectQuery connection does access a server "live" but has many limitations.
The data is fetched when you load it into your Power BI desktop application and then loaded into the cloud when you publish the report to a workspace. Once it's there, the data shown to the user is fetched from the cloud copy, not the original data source.
Same answer as above regarding where the data is fetched from (the cloud copy). I don't believe there is shared cache between users but rather each user has some temporary caching individually. This type of caching saves the calculation results (computed on the underlying data) that are needed to populate the report visuals.
There is some caching done temporarily so that if a user switches among slicer combinations to one previously chosen you may see much quicker loading than when selecting a new configuration since it cached the results and doesn't need to recompute them. As far as I understand, this kind of caching is short-lived and not shared among users. Remember, this type of cache is not the same as the underlying data in the cloud copy of the PBIX.
I've not used an on-premise server, but I would expect the behavior to be similar with the exception that the service is on the local server instead of a could server somewhere else.
The upshot is that traffic in the service is separated from the requests to the original source data (assuming no DirectQuery connections). Those original sources are only accessed during data refreshes, which are independent of end-user actions (under the same assumption).

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.

Azure SQL DWH delete and restore it when requires

Is there an option to restore the deleted database in SQL DWH at a later time(more than a year )?
The documentation clearly indicates that when an Azure SQL Data Warehouse is dropped it keeps the final snapshot for seven days:
When you drop a data warehouse, SQL Data Warehouse creates a final
snapshot and saves it for seven days. You can restore the data
warehouse to the final restore point created at deletion.
The same article also mentions the fact you can vote for this feature here:
https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/35114410-user-defined-retention-periods-for-restore-points
Even if you could do this, you are basically leaving it up to someone else to be in charge of your warehouse backups. What you could do instead is take control:
Store your Azure SQL Data Warehouse schema in source code control (eg git, Azure DevOps formerly VSTS, etc). If it isn't there already you can reverse engineer the schema using SQL Server Management Studio (SSMS) versions 17.x onwards or even use the SSDT preview feature
Export your data to Data Lake or Azure Blob Storage using CREATE EXTERNAL TABLE AS SELECT (CETAS). This will export your data as flat files to storage where it won't be deleted. Alternately use Azure Data Factory to export the data and zip it up to save space.
When you need to recreate the warehouse, simply redeploy the schema from source code control and redeploy the data, eg via CTAS in to staging tables, or use Azure Data Factory to re-import. If you saved your external tables in the schema you save to source code control then it will just be there when you redeploy. INSERT back in to the main tables from the external tables.
In this way you are in charge of your warehouse schema and your data to be recreated at any point you require, whether it be a day, a month or years.
A simple diagram of the proposed design:

DynamoDB local db limits - use for initial beta-go-live

given Dynamo's pricing, the thought came to mind to use DynamoDB Local DB on an EC2 instance for the go-live of our startup SaaS solution. I've been trying to find like a data sheet for the local db, specifying limits as to # of tables, or records, or general size of the db file. Possibly, we could even run a few local db instances on dedicated EC2 servers as we know at login what user needs to be connected to what db.
Does anybody have any information on the local db limits or on this approach? Also, anybody knows of any legal/licensing issues with using dynamo-local in that way?
Every item in DynamoDB Local will end up as a row in the SQLite database file. So the limits are based on SQLite's limitations.
Maximum Number Of Rows In A Table = 2^64 but the database file limit will likely be reached first (140 terabytes).
Note: because of the above, the number of items you can store in DynamoDB Local will be smaller with the preview version of local with Streams support. This is because to support Streams the update records for items are also stored. E.g. if you are only doing inserts of these items then the item will effectively be stored twice: once in a table containing item data and once in a table containing the INSERT UpdateRecord data for that item (more records will also be generated if the item is being updated over time).
Be aware that DynamoDB Local was not designed for the same performance, availability, and durability as the production service.