Using S3 as target for AWS DMS: Uploaded File name doesn't change - amazon-web-services

We are using DMS to get data from SQL Server and load it in S3 bucket, after which the data is finally loaded into Snowflake DB using Snowpipe for Full Load.
Now, in order for Snowpipe to know there is new data in S3 bucket, the filename needs to be different than the last one. Have tried all the task setting options available (DROP_AND_CREATE, DO_NOTHING, TRUNCATE) to have the file name different, but still not working. It loads the file name as LOAD00000001.csv
In documentation it shows that file name will be incremental (eg. LOAD00000001.csv, LOAD00000002.csv .. and so on) but it's not happening. Which is why the Snowpipe is not able to register the changes.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
Can someone please help?

For DMS the incremental counter is started over from 1 each time the task is run. It does not have a "Don't override existing objects" feature.
Your best bet may be to handle the load yourself by looking for updated object timestamps in your folder or setting up S3 event notifications.

Related

How can I achieve date-based folder partitioning when I am running full-loads?

I am currently trying to run a daily snapshot (I schedule it to run every day) of an RDS (Postgres) database using AWS DMS. My destination endpoint is S3 and I am using Full Load Replication. My goal is to write every snapshot into a separate date partition. For example, I would like to write today's snapshot into an S3 folder partition that looks like:
database_schema_name/table_name/2021/11/06/13/LOAD00000001.csv.
However, it seems like I am unable to achieve the folder partitioning with full loads. Reading the docs on date-based folder partitioning (here):
You can enable date-based folder partitioning when you create an S3
target endpoint. You can enable it when you either migrate existing
data and replicate ongoing changes (full load + CDC), or replicate
data changes only (CDC only).
To my understanding of the docs date-based folder partitioning seems to be only available for CDC or CDC + full loads. I have also tried using it myself but without success.
To summarize, my goal is to be able to run full-loads via DMS and to place each day's load into a date-partitioned folder structure within S3. I am aware that I can write to S3 then place the written files into folder using Lambdas, but I was hoping to achieve that cleanly using DMS and without adding a further complexity to the system.

Putting a TWS file dependencies on AWS S3 stored file

I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.

Using Amazon S3 for customer images and thumbnails

I see on the Lambda support pages there are examples of scripts to create thumbnail images in a separate bucket any time an image is uploaded. But I'm looking at using S3 to upload customer image files for multiple customers. We will likely use something like dropzone.js for handling the uploads and I've already built a working example to upload to an existing bucket.
But since we will be dealing with multiple customers, I'm wondering what the best-practices for handling different customer files is when used in conjunction with S3 and especially with the need to display thumbnails to the customer.
I note the Lambda solution appears to use a pre-configured bucket including all of the necessary permissions and event triggers to run the script. I'm not as familiar with node.js and have done very little in Java or python, and I'm new to the aws environment.
Should I create a new bucket for each customer? Can I? Do I have to add new lambda createThumbnail permissions/event-triggers every time a new bucket is created for a new customer?
Is there a better way to do this?
I would also be curious to know (being new to node.js and aws) how difficult it would be to build a cached thumbnail only when it was requested as opposed to trying to build one whenever a file is uploaded.
SW
You can use the same bucket with each sub-folder containing thumbnails images for each customer/user (You can name each folder with ${user_id} or something similar)
The workflow could be
Full image is uploaded to S3 to customer sub-folder with from your UI (dropzone.js or whatever)
Upon success upload. Use S3 object creation event to trigger your Lambda to process & generate a thumbnail. (putting it in thumbnail sub-fol
dr is an option).
Ex:
YOUR_NEW_BUCKET
|
----customer_1
|
___Image1.png
___Image2.jpg
___Thumbnails
|
___Image1.png
___Image2.jpg

is there any way to setup s3 bucket to get append to the existing object for each run?

We have a requirement to append to the existing S3 object, when we run the spark application every hour. I have tried this code:
df.coalesce(1).write.partitionBy("name").mode("append").option("compression", "gzip").parquet("s3n://path")
This application is creating new parquet files for every run. Hence, I am looking for a workaround to achieve this requirement.
Question is:
How we can configure the S3 bucket to get append to the existing object?
It is not possible to append to objects in Amazon S3. They can be overwritten, but not appended.
There is apparently a sneaky method where a file can be multi-part copied, with the 'source' set to the file and then set to some additional data. However, that cannot be accomplished in the method you show.
If you wish to add additional data to an External Table (eg used by EMR or Athena), then simply add an additional file in the correct folder for the desired partition.

Amazon S3 and Cloudfront with TTL=0 Testing procedure

I Would like to test and see that my TTL=0 did work.
What I have:
S3 bucket that is mounted to directory in my redhat. so when I edit a simple txt file from the shell, I can open it in the aws console bucket manager and view the file. Also I have created cloudfront distribution so i can open the txt file from the cloudfront link.
Test:
I edit the txt file with the telnet, then open it from aws console on S3 bucket section, i see the file has changed, but when i open the file on the cloudfront link, it didnt change. This means the TTL=0 did not work.
How can i verify TTL=0 works ? and it is set correctly ? after creating the distribution i cannot find where to edit the TTL again.
Thanks
Quoting AWS:
Note that our default behavior isn’t changing; if no cache control header is set, each edge location will continue to use an expiration period of 24 hours before checking the origin for changes to that file. You can also continue to use Amazon CloudFront's Invalidation feature to expire a file sooner than the TTL set on that file.
You're likely not setting the cache control correctly. One way to confirm that is to Enable S3 Bucket Logging - New files will appear whenever there are new HTTP GETs from your S3 Bucket, even if they come from CloudFront.
You could also test S3 Directly, with curl (or s3curl) so you can track its headers correctly.
My recommendation is that, whenever you upload new content, you force CloudFront to Invalidate. If you're using tools like s3fs, then inotify/icron might help you
(Disclaimer: I totally hate the whole idea of mapping filesystems off to S3. They're quite different tools and you're likely to get 'leaky abstractions')
It is most likely that you are not sending any TTL headers from S3. CloudFront will look for a TTL header in the source file and if it doesn't find anything, will default to 24 hours.
You could look to set a bucket policy or use a tool like S3 browser to automatically set the headers. http://s3browser.com/automatically-apply-http-headers.php
If you just want to test then I would follow the steps below.
Create a new text file in your bucket
Through the AWS console, locate the file and check and/or add the caching headers
Retrieve the file from CloudFront
Change the file in the bucket
Check the headers of the new file in AWS console (your S3 mapping utility may erase the previous file headers)
Retrieve the new changed file from CloudFront
Sending an invalidate call to CloudFront with each request may become chargeable if you have a large number of edits a month. Plus invalidations take several minutes (sometimes 20mins or more) to propagate, meaning you could never instantly change your content.