Deleted S3 objects reappearing when "folder" is recreated - amazon-web-services

I am developing a user flow for a project wherein a user can replace a file. In practice, this means we rapidly delete and replace s3 objects. I am running into a bug when performing the following operations on A bucket.
Delete all s3 objects at a certain path: my_bucket/my_folder/*. The program uses the S3.bucket.objects.delete() method to perform this operation. When this step finishes, my_bucket/my_folder/ is no longer visible in the s3 console.
Create a new set of s3 objects at the same path my_bucket/my_folder/*. We use the put_object() operation here.
However, I am noticing something weird when step 2 is performed. On an inconsistent basis (more often when steps 1 and 2 are done in rapid succession) the creation of new objects at my_bucket/my_folder/* with step 2 triggers the re-population of that “folder” with all of the old files that were originally deleted in step 1.
This does not appear to happen when waiting > 10 minutes between executing step 1 and step 2, which makes me believe it is a caching issue.
Is it possible to get around this behavior somehow?

Related

Triggering Lambda on basis of multiple files

I'm a bit confused, as I need to run an AWS glue job, when multiple specific files are available in s3. On every file put event in s3, I am triggering a lambda which writes that file metadata to dynamodb. Here in dynamodb, I am also maintaining a counter which counts the number of required files present.
But when multiple files are uploaded at once, which triggers multiple lambdas, they write at nearly the same time in dynamodb, which impacts the counter; hence the counter is not able to count accurately.
I need a better way to start a job, when specific (multiple) files are made available in s3.
Kindly suggest a better way.
Dynamo is eventually consistent by default. You need to request a strongly consistent read to guarantee you are reading the same data that was written.
See this page for more information, or for a more concrete example, see the ConsistentRead flag in the GetItem docs.
It's worth noting that these will only minimise your problem. There will also be a very small window between read/writes where network lag causes one function to read/write while another is doing so too. You should think about only allowing one function to run at a time, or some other logic to guarantee mutually exclusive access to the DB.
It sounds like you are getting the current count, incrementing it in your Lambda function, then updating DynamoDB with the new value. Instead you need to be using DynamoDB Atomic Counters, which will ensure that multiple concurrent updates will not cause the problems you are describing.
By using Atomic counters you simply send DynamoDB a request to increment your counter by 1. If your Lambda needs to check if this was the last file you were waiting on before doing other work, then you can use the return value from the update call to check what the new count is.
Not sure what you mean by "specific" (multiple) files.
If you are expecting specific file names (or "patterns"), then you could just check for all the expected files as first instruction of your lambda function. I.e. you expect files: A.txt, B.txt, C.txt, then test if your s3 bucket contains those 3 specific files (or 3 *.txt files or whatever suits your requirements). If that's the case then keep processing, if not then return from the function. This would technically work in case of concurrency calls.

Minio/S3 scenarios where files have to be moved in batch

I searched but haven't found a satisfying solution.
Minio/S3 does not have directories, only keys (with prefixes). So far so good.
Now I am in the need to change those prefixes. Not for a single file but for a whole bunch (a lot) files which can be really large (actually no limit).
Unfortunatly these storage servers seem not to have a concept of (and does not support):
rename file
move file
What has to be done is for each file
copy the file to the new target location
delete the file from the old source location
My given design looks like:
users upload files to bucketname/uploads/filename.ext
a background process takes the uploaded files, generates some more files and uploads them to bucketname/temp/filename.ext
when all processings are done the uploaded file and the processed files are moved to bucketname/processed/jobid/new-filenames...
The path prefix is used when handling the object created notification to differentiate if it is a upload (start processing), temp (check if all files are uploaded) and processed/jobid for holding them until the user deletes them.
Imagine a task where 1000 files have to get to a new location (within the same bucket) copy and delete them one by one has a lot of space for errors. Out of storage space during the copy operation and connection errors without any chance for rollback(s). It doesn't get easier if the locations would be different bucktes.
So, having this old design and not chance to rename/move a file:
Is there any change to copy the files without creating new physical files (without duplicating used storage space)?
Any experienced cloud developer could give me please a hint how to do this bulk copy with rollbacks in error cases?
Anyone implemented something like that with a functional rollback mechanism if e.g. file 517 of 1000 fails? Copy and delete them back seems not to be way to go.
Currently I am using Minio server and Minio dotnet library. But since they are compatible with Amazon S3 this scenario could also have happend on Amazon S3.

S3 vs EFS propagation delay for distributed file system?

I'm working on a project that utilizes multiple docker containers
which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.
As an example heres the situation I'm trying to avoid:
Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.
I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).
The characteristics of the files/containers are:
- All files are small text files averaging 2kb in size (although rarely they can be 10 kb)
- Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year
- These containers are not in a swarm
- The output of each comparison are already being uploaded to S3
- Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor
(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)
Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:
The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.
Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.
In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.

How to combine multiple S3 objects in the target S3 object w/o leaving S3

I understand that the minimum part size for uploading to an S3 bucket is 5MB
Is there any way to have this changed on a per-bucket basis?
The reason I'm asking is there is a list of raw objects in S3 which we want to combine in the single object in S3.
Using PUT part/copy we are able to "glue" objects in the single one providing that all objects except last one are >= 5MB. However sometimes our raw objects are not big enough and in this case when we try to complete multipart uploading we're getting famous error "Your proposed upload is smaller than the minimum allowed size" from AWS S3.
Any other idea how we could combine S3 objects without downloading them first?
"However sometimes our raw objects are not big enough... "
You can have a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage
There is no way to have the minimum part size changed
You may want to either;
Stream them together to AWS (which does not seem like an option, otherwise you would already be doing this)
Pad the file so it fill the minimum size of 5MB (what can or cannot be feasible to you since this will increase your bill). You will have the option to either use infrequent access (when you access these files rarely) or reduced redundancy (when you can recover lost files) if you think it can be applied to these specific files in order to reduce the impact.
Use an external service that will zip (or "glue" them together) your files and then re-upload to S3. I dont know if such service exists, but I am pretty sure you can implement it your self using a lambda function (I have even tried something like this in the past; https://github.com/gammasoft/zipper-lambda)

Why are direct writes to Amazon S3 eliminated in EMR 5.x versions?

After reading this page:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
"Operational Differences and Considerations" -> "Direct writes to Amazon S3 eliminated" section.
I wonder - does this mean that writing to S3 from Hive in EMR 4.x versions will be faster than 5.x versions?
If so, isn't it kind of regression? why would AWS want to eliminate this optimization?
Writing to a Hive table which is located in S3 is a very common scenario.
Can someone clear up that issue?
This optimization originally was developed by Qubole and pushed to Apache Hive.
See here.
This feature is rather dangerous because it is a bypass of Hive fault-tolerance mechanism and also forces developers to use normally unnecessary intermediate tables which, in its turn leads to performance degradation and increases cost.
Very common use-case is when we need to merge increment data into partitioned target table, described here The query is an insert overwrite table from itself, without intermediate table (in a single query) it is rather efficient. The query can be much more complex, with many tables joined. This is what happens with Direct Writes enabled in this use-case:
Partition folder is being deleted before query finished, this causing FileNotFound exception in Mapper reading the same table which is being written fails because partition folder deleted before mapper executed.
If the target table is initially empty, first run succeeds because Hive knows there is no any partition and does not read folder. Second run fails because see (1) folder deleted before mapper finishes.
Known workaround has performance impact. Loading data incrementally is quite often use-case. Direct writing to S3 feature forces developers to use temporary table in this case to eliminate FileNotFoundException and table corruption. As a result we are doing this task much slower and much more costly than if this feature is disabled and we are writing target table from itself.
After the first failure, successful restart is impossible, table is not selectable and not writable because Hive partition exists in metadata but folder does not exists and this causing FileNotFoundException in other queries from this table, which are not overwriting it.
The same described with less details on Amazon page which you are refering: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
Another possible issue is described on Qubole page, some existing fix with using preffixes is mentioned, though this does not work with use-case above because writing new files into folder which is being read will anyway create a problem.
Also mappers, reducers may fail and restart, whole session may fail and restart, writing files directly even with postponed deleting the old ones seems not so good idea because it increases the chance of unrecoverable failure or data corruption.
To disable direct writes, set this configuration property:
set hive.allow.move.on.s3=true; --this disables direct write
You can use this feature for small tasks and when not reading the same table which is being written, though for small tasks it will not give you much. This optimizetion is most efficient when you are rewriting many partitions in a very big table and move task at the end is extremely slow, then you may want to enable it at risk of data corruption.