I have a transactional table that I load from SQL server to Amazon S3 S3 using AWS DMS. For handling updates I move the old files to archive and then process only the incremental records everytime.
This is fine when I have only insert operation in my database. But problem comes when we need to accomodate the updates. Right now for doing any updates we read the entire S3 file and make changes to the records which are updated as a part of incremental load. Then as the data keeps on increasing the process of reading the entire file in S3 bucket and updating it would take more time and in future the job might not be able to end in time (Considering that job needs to end in 1 hour).
This can be handled using Databricks where we can use Delta Table to update the records and finally overwrite the existing file. But Databricks is bit expensive.
How do we handle the same using AWS Glue?
Related
I have a bunch of tables in DynamoDB (80 right now but can grow in future) and looking to sync data from these tables to either Redshift or S3 (using Glue on top of it to query using Athena) for running analytics query.
There are also very frequent updates (to existing entries) and deletes happen in the DynamoDB tables which I want to sync along with addition of newer entries.
Checked the Write capacity units (WCU) consumed for all tables and the rate is coming out to be around 35-40 WCU per second at peak times.
Solutions I considered:
Use Kinesis Firehose along with lambda (which reads updates from DDB streams) to push data to Redshift in small batches (Issue: it cannot support updates and deletes and is only good for adding new entries because it uses Redshift COPY command under the hood to upload data to Redshift)
Use Lambda (which reads updates from DDB streams) to copy data to S3 directly as json with each entry being a separate file. This can support updates and deletes if we have S3 filepath same as primary key of DynamoDB tables (Issue: It will result in tons of small files in S3 which might not scale for querying using AWS Glue)
Use Lambda (which reads updates from DDB streams) to update data to Redshift directly as soon as a new update happens. (Issue: Too many small writes to Redshift can cause scaling issues for Redshift as it is more suited for batch writes/updates)
My Amazon S3 path is as follows:
s3://dev-mx-allocation-storage/ph_test_late_waiver/{year}/{month}/{day}/{flow_number}*.csv
I need to create a pipeline from S3 to Snowflake where for each day of the month a new csv file would fall into the bucket and that csv file should be inserted into a snowflake table.
I am very new to this, can I please get a command in snowflake which can do that?
Snowpipe lends itself well to real-time requirements of data, as it loads data based on triggers and can manage vast and continuous loading. Data volumes and the compute/storage resources to load data are managed by the Snowflake cloud, which is why it is promoted as a serverless feature. If it’s one less thing to manage, all the better to focus our energies on our own application development!
Step by step guide: https://medium.com/#walton.cho/auto-ingest-snowpipe-on-s3-85a798725a69
I have a csv file consisting of 2M records which would be uploaded to AWS S3 once or twice every day.I need to dump this file in our database which can at time handle approximately ~1K records OR ~40-50k/min using batch upload.
I was planning to use AWS lambda but since it has timeout of 15min I would only be able to insert ~0.7M records.I also read that we can invoke another lambda function with new offset but I am looking to process this file at a stretch.
What should be my ideal approach for such scenarios.Should I spin up an EC2 instance for handling batch uploads ?
Any help would be appreciated
Consider using Database Migration Service.
You can migrate data from an Amazon S3 bucket to a database using AWS DMS. The source data files must be in comma-separated value (.csv) format.
Why don't you have one lambda running through the file and inserting the records into SQS?
Pretty sure this takes less than 15 minutes. A second Lambda consumes the records from SQS and inserts them into the database. This way you don't risk overloading your database since the lambda won't retrieve more than 10 records from the queue.
Of course this is one solution of many.
So I have some historical data on S3 in .csv/.parquet format. Everyday I have my batch job running which gives me 2 files having the list of data that needs to be deleted from the historical snapshot, and the new records that needs to be inserted to the historical snapshot. I cannot run insert/delete queries on athena. What are the options (cost effective and managed by aws) do I have to execute my problem?
Objects in Amazon S3 are immutable. This means that be replaced, but they cannot be edited.
Amazon Athena, Amazon Redshift Spectrum and Hive/Hadoop can query data stored in Amazon S3. They typically look in a supplied path and load all files under that path, including sub-directories.
To add data to such data stores, simply upload an additional object in the given path.
To delete all data in one object, delete the object.
However, if you wish to delete data within an object, then you will need to replace the object with a new object that has those rows removed. This must be done outside of S3. Amazon S3 cannot edit the contents of an object.
See: AWS Glue adds new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3
Data Bricks has a product called Delta Lake that can add an additional layer between queries tools and Amazon S3:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake supports deleting data from a table because it sits "in front of" Amazon S3.
We have a hundreds of thousands of .csv files stored in S3 that contain at least several data records each. (each record is its own row)
I am trying to design a migration strategy to transform all the records in the .csv files and put them into DynamoDB. During the migration, I'd also like to ensure that if any new .csv gets added to the S3 bucket, we automatically trigger a lambda or something to do the transformation and write to DynamoDB as well.
Eventually we'd stop writing to S3 entirely, but initially we need to keep those writes and any writes to S3 to also trigger a write to DynamoDB. Does anyone know of any good strategies for doing this? (Is there something like DynamoDB streams except for S3?) Any strategies for getting the existing things in .csv in S3 over to DynamoDB in general?
AWS has many tools you can use to solve this problem. Here are a few.
You could use AWS Database Migration Service. It supports migrating data from S3 and into DynamoDB. This AWS product is designed specifically for your use case, and it handles pretty much everything.
Once the migration has started, DMS manages all the complexities of the migration process including automatically replicating data changes that occur in the source database during the migration process.
S3 can publish events to trigger a lambda function which can be used to continuously replicate the data to DynamoDB.
AWS Data Pipelines basically does batch ETL jobs, which could move your data all at once from S3 to DynamoDB. You might also be able to run periodic sync jobs if you can tolerate a delay in replicating data to DynamoDB.
AWS Glue can crawl your data, process it, and store it in another location. I think it would provide you with an initial load plus the ongoing replication. While it could work, it’s designed more for unstructured data, and you have CSV files which are usually structured.
I’d recommend using AWS Database Migration Service because it’s the one-stop solution, but if you can’t use it for some reason, there are other options.
I don't know if DynamoDB has "load records from CSV" feature (RedShift does).
If it does not, then you could roll your own. Write a Python function that imports the csv and boto3 modules, takes as input an S3 path (inside an event dictionary). The function would them download the file from S3 to temp dir, parse it with csv, then use boto3 to insert into DynamoDB.
To get the history loaded, write a function that uses `boto3' to read the list of objects in S3, then call the first function to upload to DynamoDB.
To get future files loaded, install the first function as a Lambda function, and add a trigger from S3 Object Creation events to run the function whenever a new object is put onto S3.