There is 10TB of data in SNOWFLAKE db in AWS US region. The requirement is to split a subset of data with certain flag in a column to AWS Australia region.
After split, the US data will be around 6TB and Australia around 4TB.
There are 10 applications containing this mix of data.
I could think of 3 options to do this split.
1. Replicate the entire database from A to B. Then pause application before breaking the replication. In B, delete data in B database where filter is A's data. In A, repeat the delete where filter is B's data. Clone application set and configure new set to read/write to B
2. Use CTAS in B with data from A
3. Use SSIS to push data from A to B. For this option, the application need not be stopped.
Please advise on these options and if there is/are anymore options in which this data split can be achieved.
Regards,
Mani
The whole setup on how these 10 applications access your snowflake table are unclear, but important to provide a solution.
Your best option to sync data over two snowflake accounts is using database replication failover:
https://docs.snowflake.net/manuals/user-guide/database-replication-failover.html
To split data based on a field can easily be done with materialized views that have a where clause containing this field. https://docs.snowflake.net/manuals/user-guide/views-materialized.html
Related
my company wants to build a Data warehouse in Redshift. We have an OLTP database running in Amazon Aurora and we are thinking of using the DMS (data migration service). I am trying to get my head around the capabilities of CDC (change data capture). The thing is that CDC (over DMS) replicates and stores changes (in our case in Redshift) and I was wondering if it is possible to select specific columns which I want to store (this should be possible to do with table mapping - include) and based on which I want to store? As far as I understand it, if any columns of a row are updated, then the replication is triggered, which could mean a replication that is useless (e.g. if somebody updates a column that I do not want to follow)
E.g. I have a table with leads, which has some 30 columns. Now I am interested in the DW purposes only in 5 columns and I want to get a new line to the redshift table only if any of those 5 columns changes (is updated)... like if the stage of lead is changed, I will get a new line. On the other hand, I am not interested in the column 'Salesmans_comment' so if the salesman updates a comment, I do not want to have a new line, because I am not interested in it...Cheers!
I have run through most of the available yt tutorials and read through the documentation, but I haven't found a clear answer...
Thanks
I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.
I have BigQuery Data Transfer Service for Campaign Manager setup in dataset A in GCP project A. I would like to move this to dataset B located in project B. How can I move the existing data and setup the BigQuery Transfer with any loss of data and duplicates?
I'm afraid you would have to:
Copy the relevant tables from dataset A to dataset B
Set up the transfer service again for dataset B (assuming it can be done if the tables already exist in the target dataset)
De-dup the data yourself.
A workaround, that achieve something similar but not exactly what you asked, is to create views of relevant tables in dataset A into dataset B.
By doing so, these views will behave like proper tables in dataset B and you don't have to worry about de-deplication, data loss, and setting up again the data transfer. The downside is that you will have to keep dataset A around.
Here's how I migrated the transfer service:
The transfer service was enabled in the project B.
Once the data started to arrive at dataset B, the historical
data(from the starting till MIN(partition_date)-1) was copied from
dataset A to appropriate partitions in dataset B.
The transfer service in project A was stopped after verifying the
partition counts and row counts.
AWS Redshift team recommend using TRUNCATE in order to clean up a large table.
I have a continuous EC2 service that keeps adding rows to a table. I would like to apply some purging mechanism, so that when the cluster is near full it will auto delete old rows (say using the index column).
Is there some best practice for doing that?
Do I need to write my own code to handle that? (if so is there already a Python script for that that I can use e.g. in a Lambda function?)
A common practice when dealing with continuous data is to create a separate table for each month, eg Sales-2018-01, Sales-2018-02.
Then create a VIEW that combines the tables:
CREATE VIEW sales AS
SELECT * FROM Sales-2018-01
UNION
SELECT * FROM Sales-2018-02
Then, create a new table each month and remove the oldest month from the View. This effectively gives a 12-month rolling view of the data.
The benefit is that data does not have to be deleted from tables (which would then require a VACUUM). Instead, the old table can simply be dropped, or kept around for historical reporting with a different View.
See: Using Time Series Tables - Amazon Redshift
Rather new to AWS Data Pipeline so any help will be appreciated. I have used the pipeline template RDStoS3CopyActivity to extract all contents of a table in RDS MySQL. Seems to be working good. But there are 90 other tables to be extracted and dumped to S3. I cannot imagine craeting 90 pipelines or one for each table.
What is the best approach to resolving this task? How could pipeline be instructed to iterate though a list of the table names?
I am not sure if this will ever get responded. However, in this early stage of exploration, I have developed a pipeline that seems to fit a preliminary purpose -- extracting from 10 RDS MySQL tables and copying each to their respective sub-bucket on S3.
The logic is rather simple.
Configure connection for the RDS MySQL.
Extract data by specifying in "Select Query" field for each table.
Drop a Copy Activity and link up for each table above. It runs on a specified EC2 instance. If you're running expensive query, make sure you choose the appropriate EC2 instance with enough CPU and memory. This step copies the extracted dump, which lives temporarily in ec2 tmp filesystem, to a designated S3 bucket you will set up next.
Finally, the designated / target destination.
By default, data extracted and loaded to S3 bucket will be comma separated. If you need it to be tab delimited, then in the last target S3 destination:
- Add an optional field.. > select Data Format.
- Create a new Tab Separated. This will appear under the category of 'Others'.
- Give it a name. I call it Tab Separated.
- Type: TSV. Hover mouse over 'Type' to learn more of other data formats.
- Column separator: \t (i could leave this blank as type was already specified as tsv)
Screenshot
-
If the tables are all in the same RDS Why not use a SQLActivity pipeline with a SQL statement containing multiple unload commands to S3?
You can just write one query and use one pipeline.