Load a new file every day from S3 bucket to Snowflake table - amazon-web-services

My Amazon S3 path is as follows:
s3://dev-mx-allocation-storage/ph_test_late_waiver/{year}/{month}/{day}/{flow_number}*.csv
I need to create a pipeline from S3 to Snowflake where for each day of the month a new csv file would fall into the bucket and that csv file should be inserted into a snowflake table.
I am very new to this, can I please get a command in snowflake which can do that?

Snowpipe lends itself well to real-time requirements of data, as it loads data based on triggers and can manage vast and continuous loading. Data volumes and the compute/storage resources to load data are managed by the Snowflake cloud, which is why it is promoted as a serverless feature. If it’s one less thing to manage, all the better to focus our energies on our own application development!
Step by step guide: https://medium.com/#walton.cho/auto-ingest-snowpipe-on-s3-85a798725a69

Related

AWS ETL Job to write data to s3 bucket in csv format

I am completly new to AWS, and need your support to point me to right direction on my requirement.
Requirement:
I need to read multiple csv files from s3 bucket, union the data perform some transformation and load it to another s3 bucket.
Issue:
I understand that Lambda is one of the option to do the same, but the data is huge so i belive at somepoint 15min limitation will be a issue to me.
Also for Glue ETL, fom what i read i understand it does suport the output to be s3.
Ask
Could you suggest any other ELT services and the link to help me get started.

How to perform Backfilling in redshift to bigquery migration?

I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.

How to run delete and insert query on S3 data on AWS

So I have some historical data on S3 in .csv/.parquet format. Everyday I have my batch job running which gives me 2 files having the list of data that needs to be deleted from the historical snapshot, and the new records that needs to be inserted to the historical snapshot. I cannot run insert/delete queries on athena. What are the options (cost effective and managed by aws) do I have to execute my problem?
Objects in Amazon S3 are immutable. This means that be replaced, but they cannot be edited.
Amazon Athena, Amazon Redshift Spectrum and Hive/Hadoop can query data stored in Amazon S3. They typically look in a supplied path and load all files under that path, including sub-directories.
To add data to such data stores, simply upload an additional object in the given path.
To delete all data in one object, delete the object.
However, if you wish to delete data within an object, then you will need to replace the object with a new object that has those rows removed. This must be done outside of S3. Amazon S3 cannot edit the contents of an object.
See: AWS Glue adds new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3
Data Bricks has a product called Delta Lake that can add an additional layer between queries tools and Amazon S3:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake supports deleting data from a table because it sits "in front of" Amazon S3.

AWS Glue ETL : transfer data to S3 Bucket

I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.
You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue
Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.

how to update Big Query back-end data on each upload for bucket

I have created the Big Query out of the data I have in my Cloud storage bucket.
In my use case, I am sending data periodically to the same bucket which is backend of my Big Query(while creating the Big query table I used the same bucket name).
Is it possible to get the updated data into Big Query, as I am pushing new data each time into the same bucket on some interval basis.
Just to mention - I am making native Big query from my dedicated storage bucket mentioned above.
Your help will be much appreciated. thanks in advance.
You can create an external (federated) table on Google Cloud Storage Bucket. In this case, whenever you query this table you will get the latest data.
If you just need to append data to a table (let's call it target table) based on data from the bucket - I can imagine following this process:
Create a federated table on the GCS bucket
Setup a simple cron job that runs a bq command which is just doing select * from [federated_table] and appends results into the target table (you may have a more complicated query that will check duplication of data in the target table and only appends new data).
Alternative option:
Setup a trigger on your bucket that activates cloud function and in a cloud function you just load the newly added data to the target table.