How to load data from AWS S3 to Snowflake Internal Stage - database-migration

We are trying to take our data from an AWS S3 (external stage) and load it into a Snowflake internal stage. Snowflake should act as our data lake, and can reduce the amount of storage we use from AWS. Is there any built in functionality that can transfer data from external stage --> internal stage?
The goal is to load the data into the internal Snowflake stage and subsequently delete the data from AWS. We want Snowflake to be the data lake.

What do you mean internal stage?
If you are planning to load into Snowflake tables, your scenario is perfect use case for Snowpipe, for more info Automating Snowpipe for Amazon S3

An internal stage would just be a different S3 bucket utilized by Snowflake. So it's not really "reducing" the amount of storage, just changing its location. If you still wanted to do this, you could GET from your external stage and PUT to the internal stage. Or you could just load from the external stage to your tables in Snowflake via any of the available methods.

You've got to stop thinking that a "data lake" means a bunch of raw data files stored in a cloud bucket somewhere. That's the 2010 version of a data lake.
In Snowflake, you can load the raw data into tables that mirror those files (either structured column-by-column, or semi-structured JSON,XML,Parquet...). Think of these tables as your "raw" zone. With Streams and Tasks, you can automate the curation of the data in the raw zone into a second set of tables - the "curated" zone. Another set of Streams/Tasks might go another step and pre-aggregate the curated data into an "aggregated" zone. The design of the workflows is up to you.
The cloud storage just becomes a "landing area" for raw extracted data, and can be deleted after ingestion into Snowflake. You now have a single platform for your raw data, curated data, and aggregated data. Hook up a data governance tool like Alation or Collibra to maintain the lineage of the data through its journey.
-Paul-

Related

Data Copy from Azure blob to S3 andn Synapse to Redshift

There is a requirement to copy from Azure Blob to S3 for 10TB data and also from Synpase to Redshift for 10TB of data.
What is the best way to achieve these 2 migrations?
For the Redshift - you could export Azure Synapse Analytics to a a blob storage in a compatible format ideally compressed and then copy the data to S3. It is pretty straightforward to import data from S3 to Redshift.
You may need a VM instance to load read from Azure Storage and put into AWS S3 (doesn't matter where). The simplest option seems to be using the default CLI (Azure and AWS) to read the content to the migration instance and write to to the target bucket. However me personally - I'd maybe create an application writing down checkpoints, if the migration process interrupts for any reason, the migration process wouldn't need to start from the scratch.
There are a few options you may "tweak" based on the files to move, if there are many small files or less large files, from which region to move where, ....
https://aws.amazon.com/premiumsupport/knowledge-center/s3-upload-large-files/
As well you may consider using the AWS S3 Transfer Acceleration, may or may not help too.
Please note every larger cloud provider has some outbound data egress cost, for 10TB it may be considerable cost

Load a new file every day from S3 bucket to Snowflake table

My Amazon S3 path is as follows:
s3://dev-mx-allocation-storage/ph_test_late_waiver/{year}/{month}/{day}/{flow_number}*.csv
I need to create a pipeline from S3 to Snowflake where for each day of the month a new csv file would fall into the bucket and that csv file should be inserted into a snowflake table.
I am very new to this, can I please get a command in snowflake which can do that?
Snowpipe lends itself well to real-time requirements of data, as it loads data based on triggers and can manage vast and continuous loading. Data volumes and the compute/storage resources to load data are managed by the Snowflake cloud, which is why it is promoted as a serverless feature. If it’s one less thing to manage, all the better to focus our energies on our own application development!
Step by step guide: https://medium.com/#walton.cho/auto-ingest-snowpipe-on-s3-85a798725a69

How to perform Backfilling in redshift to bigquery migration?

I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.

How to run delete and insert query on S3 data on AWS

So I have some historical data on S3 in .csv/.parquet format. Everyday I have my batch job running which gives me 2 files having the list of data that needs to be deleted from the historical snapshot, and the new records that needs to be inserted to the historical snapshot. I cannot run insert/delete queries on athena. What are the options (cost effective and managed by aws) do I have to execute my problem?
Objects in Amazon S3 are immutable. This means that be replaced, but they cannot be edited.
Amazon Athena, Amazon Redshift Spectrum and Hive/Hadoop can query data stored in Amazon S3. They typically look in a supplied path and load all files under that path, including sub-directories.
To add data to such data stores, simply upload an additional object in the given path.
To delete all data in one object, delete the object.
However, if you wish to delete data within an object, then you will need to replace the object with a new object that has those rows removed. This must be done outside of S3. Amazon S3 cannot edit the contents of an object.
See: AWS Glue adds new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3
Data Bricks has a product called Delta Lake that can add an additional layer between queries tools and Amazon S3:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake supports deleting data from a table because it sits "in front of" Amazon S3.

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!