In context what is difference between using this way of sync and make backup of db, sync this backup file and restore from this file on other side database. Of course, in context, that size of backup will be no more, for example, 10 MB.
when you use backup, you simply overwrite when you restore. you have no option to resolve conflicts and decide which row to keep (e.g., the row from source was updated and the row on the destination was updated as well)
Related
I want to schedule the data transfer job between Cloud Storage to BigQuery.
I have one application that dumps data continuously to the GCS bucket path (let's say gs://test-bucket/data1/*.avro) that I want to move to BigQuery as soon as the object is created in GCS.
I don't want to migrate all the files available within the folder again and again. I just want to move only the newly added object after the last run in the folder.
BigQuery data transfer service is available that takes Avro files as input but not a folder and it does not provide only newly added objects instead all.
I am new to it so might be missing some functionality, How can I achieve it?
Please note- I want to schedule a job to load data at a certain
frequency (every 10 or 15 min), I don't want any solution from a
trigger perspective since the number of objects that will be generated
will be huge.
You can use Cloud Function and Storage event trigger. Just launch Cloud Function that loads data into BigQuery when new file arrives.
https://cloud.google.com/functions/docs/calling/storage
EDIT: If you have more than 1500 loads per day you can workaround with loading using BQ Storage API.
If you do not need superb performance then you can just create an external table on that folder and query it instead loading every file.
I wanted to back-up my datomic DB. I am familiar with the steps defined at https://docs.datomic.com/on-prem/backup.html but since the data size is huge (in TBs), I wanted to only backup a small preview of the entire DB, like one or two attribute values from every entity.
Also, if I had that backup present at some location, like S3 or something, is it possible to just copy only a certain part of the entire backup so that it becomes a small preview of the entire DB as described above?
What I do:
I built ETL processes with power query to load data (production machine stop history) from multiple Excel files directly into PowerBI.
On each new shift (every 8 hrs.) there is a new excel file generated by the production machine that need to be loaded to the data model too.
How I did it:
To do so, power query is processing all files found in a specific folder.
The problem:
During query refresh it need to process all the data files again and again (old files + new files).
If I remove the old files from the folder, power query removes the data also from the data model during the next refresh cycle.
What I need / My question:
A batch process copies new files into the folder while removing all the old files.
Is there a possibility to configure powery query in a way that it keeps the existing data inside the data model and just extend it with the data from the new files?
What I would like to avoid:
I know building a database would be one solution but this requires a second system with new ETL process. But power query does already a very good job for preprocessing the data! Therefore and if possible, it would be highly appreciated if this problem could be solved directly inside power query / power bi.
If you want to shoot sparrows with a cannon gun, you could try incremental refresh, but it's Premium feature.
In Power BI refreshing a dataset reloads it, so first it is cleared, and second - you will need all the files to re-load them and recalculate everything. If you don't want this, you have to either change your ETL to store the data outside of the report's dataset (e.g. a database would be a very good choice), or to push the data from the new files only to a dataset (which I wouldn't recommend in your case).
To summarize - the best solution is to build ETL process and put the data in a datawarehouse, and then to use it as a datasource for your reports.
I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks
One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.
I am using Microsoft Synch Service Framework 4.0 for synching Sql server Database tables with SqlLite Database on the Ipad side.
Before making any Database schema changes in the Sql Server Database, We have to Deprovision the database tables. ALso after making the schema changes, we ReProvision the tables.
Now in this process, the tracking tables( i.e. the Synching information) gets deleted.
I want the tracking table information to be restored after Reprovisioning.
How can this be done? Is it possible to make DB changes without Deprovisioning.
e.g, the application is in Version 2.0, The synching is working fine. Now in the next version 3.0, i want to make some DB changes. SO, in the process of Deprovisioning-Provisioning, the tracking info. gets deleted. So all the tracking information from the previous version is lost. I do not want to loose the tracking info. How can i restore this tracking information from the previous version.
I believe we will have to write a custom code or trigger to store the tracking information before Deprovisioning. Could anyone suggest a suitable method OR provide some useful links regarding this issue.
the provisioning process should automatically populate the tracking table for you. you don't have to copy and reload them yourself.
now if you think the tracking table is where the framework stores what was previously synched, the answer is no.
the tracking table simply stores what was inserted/updated/deleted. it's used for change enumeration. the information on what was previously synched is stored in the scope_info table.
when you deprovision, you wipe out this sync metadata. when you synch, its like the two replicas has never synched before. thus you will encounter conflicts as the framework tries to apply rows that already exists on the destination.
you can find information here on how to "hack" the sync fx created objects to effect some types of schema changes.
Modifying Sync Framework Scope Definition – Part 1 – Introduction
Modifying Sync Framework Scope Definition – Part 2 – Workarounds
Modifying Sync Framework Scope Definition – Part 3 – Workarounds – Adding/Removing Columns
Modifying Sync Framework Scope Definition – Part 4 – Workarounds – Adding a Table to an existing scope
Lets say I have one table "User" that I want to synch.
A tracking table will be created "User_tracking" and some synch information will be present in it after synching.
WHen I make any DB changes, this Tracking table "User_tracking" will be deleted AND the tracking info. will be lost during the Deprovisioning- Provisioning process.
My workaround:
Before Deprovisioning, I will write a script to copy all the "User_tracking" data into another temporary table "User_tracking_1". so all the existing tracking info will be stored in "User_tracking_1". WHen I reprovision the table, a new trackin table "User_Tracking" will be created.
After Reprovisioning, I will copy the data from table "User_tracking_1" to "User_Tracking" and then delete the contents from table "User_Tracking_1".
UserTracking info will be restored.
Is this the right approach...