I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.
Related
My Amazon S3 path is as follows:
s3://dev-mx-allocation-storage/ph_test_late_waiver/{year}/{month}/{day}/{flow_number}*.csv
I need to create a pipeline from S3 to Snowflake where for each day of the month a new csv file would fall into the bucket and that csv file should be inserted into a snowflake table.
I am very new to this, can I please get a command in snowflake which can do that?
Snowpipe lends itself well to real-time requirements of data, as it loads data based on triggers and can manage vast and continuous loading. Data volumes and the compute/storage resources to load data are managed by the Snowflake cloud, which is why it is promoted as a serverless feature. If it’s one less thing to manage, all the better to focus our energies on our own application development!
Step by step guide: https://medium.com/#walton.cho/auto-ingest-snowpipe-on-s3-85a798725a69
So I have some historical data on S3 in .csv/.parquet format. Everyday I have my batch job running which gives me 2 files having the list of data that needs to be deleted from the historical snapshot, and the new records that needs to be inserted to the historical snapshot. I cannot run insert/delete queries on athena. What are the options (cost effective and managed by aws) do I have to execute my problem?
Objects in Amazon S3 are immutable. This means that be replaced, but they cannot be edited.
Amazon Athena, Amazon Redshift Spectrum and Hive/Hadoop can query data stored in Amazon S3. They typically look in a supplied path and load all files under that path, including sub-directories.
To add data to such data stores, simply upload an additional object in the given path.
To delete all data in one object, delete the object.
However, if you wish to delete data within an object, then you will need to replace the object with a new object that has those rows removed. This must be done outside of S3. Amazon S3 cannot edit the contents of an object.
See: AWS Glue adds new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3
Data Bricks has a product called Delta Lake that can add an additional layer between queries tools and Amazon S3:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake supports deleting data from a table because it sits "in front of" Amazon S3.
I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.
You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.
I have a Quick Sight dashboard pointed to Athena table. Now I want to schedule to refresh SPICE every hour. As per documentation, Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
If I have a 2TB dataset in Athena and every hour new data added in Athena. So QuickSight will load 2TB every hour to find the delta? if yes, it will increase the Athena cost. Does QuickSight query on Athena to fetch data?
As of the date of answering (11/11/2019) SPICE does in fact perform a full data set reload (i.e. no delta calculation or incremental refresh). I was able to verify this by using a MySQL data set and watching the query log while the refresh was occurring.
The implication for your question is that you would be charged every hour for Athena to query the 2TB data set.
If you do not need the robust querying that Athena provides, I would recommend pointing QuickSight to the S3 data directly.
My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.
Yes, we need to use Athena to read the parquet.
When you say point QuickSight to S3 directly, do you mean without SPICE?
Don't do it, it will increase the Athena and S3 costs significantly.
Sollution:
Collect the delta from your source.
Push it into S3 (Unprocessed data)
Create a lambda function to pre-process the data (if needed)
Set up a trigger for lambda.
Process the data in lambda, and convert the data to parquet format with gzip compression.
Push the data into S3 (Processed data)
Remove the unprocessed data from S3 or set up an S3 lifecycle to maintain it.
Also create a metadata table with primary_key and required fields.
S3 & Athena do not support update records, so each time you push the data it will be appended to the old data, and the entire data will be scanned.
Both S3 and Athena follow the scan-first approach, so even though you are applying a filter it will scan the entire data before it applies the filter.
Use the metadata table to remove the old entry and insert the new entry.
Use partitions wherever possible to avoid scanning the entire data.
Once the data is available, configure Quicksight data refresh to pull the data into SPICE.
Best practice:
Always go with SPICE (Direct queries are expensive and have high latency)
Use the incremental refresh wherever possible.
Always use static data, do not process the data for each dashboard visit/refresh.
Increase your Quicksight SPICE data refresh frequency
I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.