Is there an option to restore the deleted database in SQL DWH at a later time(more than a year )?
The documentation clearly indicates that when an Azure SQL Data Warehouse is dropped it keeps the final snapshot for seven days:
When you drop a data warehouse, SQL Data Warehouse creates a final
snapshot and saves it for seven days. You can restore the data
warehouse to the final restore point created at deletion.
The same article also mentions the fact you can vote for this feature here:
https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/35114410-user-defined-retention-periods-for-restore-points
Even if you could do this, you are basically leaving it up to someone else to be in charge of your warehouse backups. What you could do instead is take control:
Store your Azure SQL Data Warehouse schema in source code control (eg git, Azure DevOps formerly VSTS, etc). If it isn't there already you can reverse engineer the schema using SQL Server Management Studio (SSMS) versions 17.x onwards or even use the SSDT preview feature
Export your data to Data Lake or Azure Blob Storage using CREATE EXTERNAL TABLE AS SELECT (CETAS). This will export your data as flat files to storage where it won't be deleted. Alternately use Azure Data Factory to export the data and zip it up to save space.
When you need to recreate the warehouse, simply redeploy the schema from source code control and redeploy the data, eg via CTAS in to staging tables, or use Azure Data Factory to re-import. If you saved your external tables in the schema you save to source code control then it will just be there when you redeploy. INSERT back in to the main tables from the external tables.
In this way you are in charge of your warehouse schema and your data to be recreated at any point you require, whether it be a day, a month or years.
A simple diagram of the proposed design:
Related
Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins
The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.
Recently unlinked and re-linked a Firebase project with a different Google Analytics account.
The BigQuery integration configured to export GA data created the new dataset and data started populating into that.
The old dataset corresponding to the unlinked, "default" GA account, which contained ~2 years of data is still accessible in the BigQuery UI, however only the 5 most recent event_ tables are visible in the dataset. (5 days worth of event data)
Is it possible to extract historical data from the old, unlinked dataset?
What I could suggest, it's to do some queries for further validate the data that you have within your BigQuery dataset.
In this case, I would start by getting the dates for each table to see the amount (days) of data contained on the dataset.
SELECT event_date
FROM `firebase-public-project.analytics_153293282.events_*`
GROUP BY event_date ORDER BY event_date
EDIT
A better way to do this, and get all the tables within the dataset, is using the bq command line tool, see reference here.
bq ls firebase-public-project:analytics_153293282
You'll get something like this:
You could also do a COUNT(event_date), so you can see how many records you have per day, and compare this to the content that you have or you can see on your Firebase project.
SELECT event_date, COUNT(event_date) ...
On the case that there's data missing, you could use table decorators, to try to recover that data, see example here.
About the table's expiration date you can see this, in short, expiration time can be set by default at dataset level and it would be applied for new tables (existing tables require a manual update of their expiration time one by one), and expiration time can be set during the creation of the table. To see if there was any change on the expiration time you could look into your logs for protoPayload.methodName="tableservice.update", and see if there was set an expireTime as follows:
tableUpdateRequest: {
resource: {
expireTime: "2020-12-31T00:00:00Z"
...
}
}
Besides this, if you have a GCP support plan, you could reach them looking for further assistance on what could have happened with your tables on that dataset. Otherwise, you could open an issue tracker. Keep in mind that Firebase doesn't delete your data when unlinking a Firebase project from BigQuery, so in theory the data should be there.
Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example
Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.
Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.
In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.
I'm currently using Azure Data Factory to move over data from an Azure SQL database to an Azure DW instance.
This works fine for one table, but I have a lot of tables I'd like to move over. Using Azure Data Facory, it looks like I need to create a set of Source/Sink datasets and pipelines for every table in the database.
Is there a way to move multiple tables across without have to set up each table in the manner described above?
The copy operation allows you to select multiple tables to move in a single pipeline. From the Azure SQL Data Warehouse portal you can follow this process to setup a multi-table pipeline:
Click on the Load Data button
Select Azure Data Factory
Create a new data factory or use an existing one - ensure that the Load Data select is chosen
Select the Run once now option
Choose your Azure SQL Database source and enter the credentials
On the Select Tables screen, select multiple tables
Continue the Pipeline, save and execute
What variables determine the length of time it takes to restore an Azure SQL Data Warehouse database?
I am creating a new ADW database on a new logical server (in the same Azure region) using the Azure portal and the source specified as the backup for another ADW database (that has ~100TB of uncompressed data loaded to compressed columnar tables).
The Azure Portal reports the status as "Deploying", but am unclear whether this will be an operation taking minutes, hours, or days.
Is there any way to track the progress?
Restore time for Azure SQL Data Warehouse is primarily determined by two factors: the size of the database that you are restoring, and the location you are restoring it into. The fastest restore will be in the paired region, the second will be the original region, and finally everywhere else.