Minio/S3 scenarios where files have to be moved in batch - amazon-web-services

I searched but haven't found a satisfying solution.
Minio/S3 does not have directories, only keys (with prefixes). So far so good.
Now I am in the need to change those prefixes. Not for a single file but for a whole bunch (a lot) files which can be really large (actually no limit).
Unfortunatly these storage servers seem not to have a concept of (and does not support):
rename file
move file
What has to be done is for each file
copy the file to the new target location
delete the file from the old source location
My given design looks like:
users upload files to bucketname/uploads/filename.ext
a background process takes the uploaded files, generates some more files and uploads them to bucketname/temp/filename.ext
when all processings are done the uploaded file and the processed files are moved to bucketname/processed/jobid/new-filenames...
The path prefix is used when handling the object created notification to differentiate if it is a upload (start processing), temp (check if all files are uploaded) and processed/jobid for holding them until the user deletes them.
Imagine a task where 1000 files have to get to a new location (within the same bucket) copy and delete them one by one has a lot of space for errors. Out of storage space during the copy operation and connection errors without any chance for rollback(s). It doesn't get easier if the locations would be different bucktes.
So, having this old design and not chance to rename/move a file:
Is there any change to copy the files without creating new physical files (without duplicating used storage space)?
Any experienced cloud developer could give me please a hint how to do this bulk copy with rollbacks in error cases?
Anyone implemented something like that with a functional rollback mechanism if e.g. file 517 of 1000 fails? Copy and delete them back seems not to be way to go.
Currently I am using Minio server and Minio dotnet library. But since they are compatible with Amazon S3 this scenario could also have happend on Amazon S3.

Related

S3 vs EFS propagation delay for distributed file system?

I'm working on a project that utilizes multiple docker containers
which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.
As an example heres the situation I'm trying to avoid:
Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.
I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).
The characteristics of the files/containers are:
- All files are small text files averaging 2kb in size (although rarely they can be 10 kb)
- Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year
- These containers are not in a swarm
- The output of each comparison are already being uploaded to S3
- Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor
(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)
Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:
The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.
Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.
In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.

S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory tree

We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar to write parquet files to an S3 bucket.
We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7' in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet being a file)
When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/' (ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet' get created in s3) we get the below error
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7' since it is a file.
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
As discussed in HADOOP-15542. you can't have files under directories in a "normal" FS; you don't get them in the S3A connector, at least where it does enough due diligence.
It just confuses every single tree walking algorithm, renames, deletes, anything which scans for files. This will include the spark partitioning logic. That new directory tree you are trying to create would probably appear invisible to callers. (you could test this by creating it, doing the PUT of that text file into place, see what happens)
We try to define what an FS should do in The Hadoop Filesystem Specification, including defining things "so obvious" that nobody bothered to write them down or write tests for, such as
Only directories can have children
All children must have a parent
Only files can have data (exception: ReiserFS)
Files are as long as they say they are (this is why S3A doesn't support client-side encryption, BTW).
Every so often we discover some new thing we forgot to consider, which "real" filesystems enforce out the box, but which object stores don't. Then we add tests, try our best to maintain the metaphor except when the performance impact would make it unusable. Then we opt not to fix things and hope nobody notices. Generally, because people working with data in the hadoop/hive/spark space have those same preconceptions of what a filesystem does, those ambiguities don't actually cause problems in production.
Except of course eventual consistency, which is why you shouldn't be writing data straight to S3 from spark without a consistency service (S3Guard, consistent EMRFS), or a commit protocol designed for this world (S3A Committer, databricks DBIO).

How to combine multiple S3 objects in the target S3 object w/o leaving S3

I understand that the minimum part size for uploading to an S3 bucket is 5MB
Is there any way to have this changed on a per-bucket basis?
The reason I'm asking is there is a list of raw objects in S3 which we want to combine in the single object in S3.
Using PUT part/copy we are able to "glue" objects in the single one providing that all objects except last one are >= 5MB. However sometimes our raw objects are not big enough and in this case when we try to complete multipart uploading we're getting famous error "Your proposed upload is smaller than the minimum allowed size" from AWS S3.
Any other idea how we could combine S3 objects without downloading them first?
"However sometimes our raw objects are not big enough... "
You can have a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage
There is no way to have the minimum part size changed
You may want to either;
Stream them together to AWS (which does not seem like an option, otherwise you would already be doing this)
Pad the file so it fill the minimum size of 5MB (what can or cannot be feasible to you since this will increase your bill). You will have the option to either use infrequent access (when you access these files rarely) or reduced redundancy (when you can recover lost files) if you think it can be applied to these specific files in order to reduce the impact.
Use an external service that will zip (or "glue" them together) your files and then re-upload to S3. I dont know if such service exists, but I am pretty sure you can implement it your self using a lambda function (I have even tried something like this in the past; https://github.com/gammasoft/zipper-lambda)

OSX- Auto Delete file after x-time

Can we add metadata to unlink/remove a file after x-time automatically. That is system automatically removes that file, if it finds that particular metadata attached with that file
Note- file can be present at any location, and user may move that file anywhere on their system, but based on that metadata file should get deleted(i.e system should call unlink/remove) for that file.
Is there a cocoa/objective-c/c++ api to set such metadata/attributes of a file?
The main point is i am creating an application through which i am providing some trial files to the user, and those files are also usable by other application which recognises them. After trial expiry, i want to delete those files, but user can always move my files to a different location and use them forever, how to protect those files from permanent use?
No, there is no built-in mechanism to auto-delete a file based on some metadata.
You could add the feature yourself, with an accompanying agent that would trawl for files with the metadata and delete them when the time came.
If you are doing this for good housekeeping you can follow #Petesh answer.
If you are doing this because you really want those files gone then no. The user could move the file to a USB stick and remove it, or edit the metadata, etc.
Your earlier question "Completely restricting all types of access to a folder" seems to addressing the same issue and the suggestions are the same as given there - use encryption or implement your own file system.
E.g. have a special "trial file" format which is the same as the ordinary format - which is readable by other apps - but encrypted and includes an expiry date. Your app then decrypts the file, checks the date, and either does its thing or reports to the user the file is out of date.
The system isn't unbreakable, but its a reasonable barrier - easy for you to do, too hard for the average user to break.

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.