Working on a project where we need to have an incremental load on daily basis, We are using Glue for the ETL purpose. We are getting duplicates or data getting doubled using Glue.
pipeline flow: Ingestion Zone, Raw Zone, Curated zone, consumption zone.
History: 1000 records. Below dates on updates and inserts
End of the Jan-11 run, I would like to see my total records of 1100 records as I'm upserting the data in rawtocurated zone. However, I'm getting the doubled-up records in the curated zone. The data is partitioned on a run date basis. like 2020/01/10/data.csv and 2020/01/11/data.csv
What changes should I make to avoid only the delta records (or) incremental records to be seen in the comsumption zone?
As per my understanding of problem statement : Glue job bookmarks feature is used along with the meta data catalog tables to ensure only new data is processed .
Few Queries :
Is your curated zone build on top of s3 or any other RDS services provided ?
Is is direct updates or SCD-2 data transformation ?
Have you by any chance reset/paused/disable the Job bookmarks ?
If you say data is partitioned on run-date basis so partitioning is applicable on ingestion layer [multiple date specific folders under on S3 bucket and data maintained in parquet format ] or on target curated layer ?
Even if that is not solving your problem that I would recommend you to write custom spark code using either pyspark/scala encapsulating your processing logics
Related
My Amazon S3 path is as follows:
s3://dev-mx-allocation-storage/ph_test_late_waiver/{year}/{month}/{day}/{flow_number}*.csv
I need to create a pipeline from S3 to Snowflake where for each day of the month a new csv file would fall into the bucket and that csv file should be inserted into a snowflake table.
I am very new to this, can I please get a command in snowflake which can do that?
Snowpipe lends itself well to real-time requirements of data, as it loads data based on triggers and can manage vast and continuous loading. Data volumes and the compute/storage resources to load data are managed by the Snowflake cloud, which is why it is promoted as a serverless feature. If it’s one less thing to manage, all the better to focus our energies on our own application development!
Step by step guide: https://medium.com/#walton.cho/auto-ingest-snowpipe-on-s3-85a798725a69
I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.
Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.
I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.
I have a Quick Sight dashboard pointed to Athena table. Now I want to schedule to refresh SPICE every hour. As per documentation, Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
If I have a 2TB dataset in Athena and every hour new data added in Athena. So QuickSight will load 2TB every hour to find the delta? if yes, it will increase the Athena cost. Does QuickSight query on Athena to fetch data?
As of the date of answering (11/11/2019) SPICE does in fact perform a full data set reload (i.e. no delta calculation or incremental refresh). I was able to verify this by using a MySQL data set and watching the query log while the refresh was occurring.
The implication for your question is that you would be charged every hour for Athena to query the 2TB data set.
If you do not need the robust querying that Athena provides, I would recommend pointing QuickSight to the S3 data directly.
My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.
Yes, we need to use Athena to read the parquet.
When you say point QuickSight to S3 directly, do you mean without SPICE?
Don't do it, it will increase the Athena and S3 costs significantly.
Sollution:
Collect the delta from your source.
Push it into S3 (Unprocessed data)
Create a lambda function to pre-process the data (if needed)
Set up a trigger for lambda.
Process the data in lambda, and convert the data to parquet format with gzip compression.
Push the data into S3 (Processed data)
Remove the unprocessed data from S3 or set up an S3 lifecycle to maintain it.
Also create a metadata table with primary_key and required fields.
S3 & Athena do not support update records, so each time you push the data it will be appended to the old data, and the entire data will be scanned.
Both S3 and Athena follow the scan-first approach, so even though you are applying a filter it will scan the entire data before it applies the filter.
Use the metadata table to remove the old entry and insert the new entry.
Use partitions wherever possible to avoid scanning the entire data.
Once the data is available, configure Quicksight data refresh to pull the data into SPICE.
Best practice:
Always go with SPICE (Direct queries are expensive and have high latency)
Use the incremental refresh wherever possible.
Always use static data, do not process the data for each dashboard visit/refresh.
Increase your Quicksight SPICE data refresh frequency
When new partition is added to an Athena table, we could use either Glue Crawler or MSCK REPAIR TABLE to update meta info. What are the cost for them? Which one is preferred?
MSCK REPAIR TABLE command requires your S3 key to include the partition scheme as documented here. If your S3 key does not include the partition scheme, the MSCK REPAIR TABLE command will return missing partitions, but you will still have to add them in. Also one other difference is that the MSCK REPAIR TABLE command can time out after 30 minutes (default Athena query time length) while glue crawler will not.
Here is pricing information:
Glue Crawler:
There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.
Pricing
For all AWS Regions where AWS Glue is available:
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Athena:
There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.
However, on top of both of these commands you will still incur S3 costs. Reference: AWS Athena: does `msck repair table` incur costs?
My opinion is it is best to manage the partition yourself if you are able to, after adding new data.
'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'
If forced to use Glue or Athena, I would evaluate which way will fit better into your process. The MSCK REPAIR TABLE command may be easier to manage but you may run into trouble if you have a lot of data in partitions or they are not partitioned correctly. Also, you will have to have a way to automate running the command. Glue Crawlers can be configured with triggers.
I agree with adding partitions manually. You can do this via an Athena query (ALTER TABLE ... ADD PARTITION () ...) as in the answer from #KiteCoder, or you can do this via the Glue API directly.
Calling the Glue API is more verbose, but also more 'structured'. Calling Athena is obviously a SQL query, and I know how many people despise writing code that dynamically generates SQL.
The specific operation is CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable operation.