S3 vs EFS propagation delay for distributed file system? - amazon-web-services

I'm working on a project that utilizes multiple docker containers
which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.
As an example heres the situation I'm trying to avoid:
Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.
I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).
The characteristics of the files/containers are:
- All files are small text files averaging 2kb in size (although rarely they can be 10 kb)
- Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year
- These containers are not in a swarm
- The output of each comparison are already being uploaded to S3
- Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor
(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)
Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:
The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.
Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.

In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.

Related

one line edits to files in AWS S3

I have many very large files (> 6 GB) stored in an AWS S3 bucket that need very minor edits done to them.
I can edit these files by pulling them to a server, using sed or perl to edit the key word, and then pushing them back, but this is very time-consuming, especially for a one-word edit to a 6 or 7 GB text file.
I use a program that makes the AWS S3 like a random-access file system, https://github.com/s3fs-fuse/s3fs-fuse, but this is unusuably slow, so it isn't an option.
How can I edit these files, or use sed, via a script without the expensive and slow step of pulling from and pushing back to S3?
You can't.
The library you use certainly does it right: download the existing file, do the edit locally, then push back the results. It's always going to be slow.
With sed, it may be possible to make it faster, assuming your existing library does it in three separate steps. But you can't send the result right back and overwrite the file before you're done reading it (at least I would suggest not doing so.)
If this is a one time process, then the slowness should not be an issue. If that's something you are likely to perform all the time, then I'd suggest you use a different type of storage. This one may not be appropriate for your app.

Minio/S3 scenarios where files have to be moved in batch

I searched but haven't found a satisfying solution.
Minio/S3 does not have directories, only keys (with prefixes). So far so good.
Now I am in the need to change those prefixes. Not for a single file but for a whole bunch (a lot) files which can be really large (actually no limit).
Unfortunatly these storage servers seem not to have a concept of (and does not support):
rename file
move file
What has to be done is for each file
copy the file to the new target location
delete the file from the old source location
My given design looks like:
users upload files to bucketname/uploads/filename.ext
a background process takes the uploaded files, generates some more files and uploads them to bucketname/temp/filename.ext
when all processings are done the uploaded file and the processed files are moved to bucketname/processed/jobid/new-filenames...
The path prefix is used when handling the object created notification to differentiate if it is a upload (start processing), temp (check if all files are uploaded) and processed/jobid for holding them until the user deletes them.
Imagine a task where 1000 files have to get to a new location (within the same bucket) copy and delete them one by one has a lot of space for errors. Out of storage space during the copy operation and connection errors without any chance for rollback(s). It doesn't get easier if the locations would be different bucktes.
So, having this old design and not chance to rename/move a file:
Is there any change to copy the files without creating new physical files (without duplicating used storage space)?
Any experienced cloud developer could give me please a hint how to do this bulk copy with rollbacks in error cases?
Anyone implemented something like that with a functional rollback mechanism if e.g. file 517 of 1000 fails? Copy and delete them back seems not to be way to go.
Currently I am using Minio server and Minio dotnet library. But since they are compatible with Amazon S3 this scenario could also have happend on Amazon S3.

S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory tree

We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar to write parquet files to an S3 bucket.
We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7' in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet being a file)
When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/' (ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet' get created in s3) we get the below error
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7' since it is a file.
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
As discussed in HADOOP-15542. you can't have files under directories in a "normal" FS; you don't get them in the S3A connector, at least where it does enough due diligence.
It just confuses every single tree walking algorithm, renames, deletes, anything which scans for files. This will include the spark partitioning logic. That new directory tree you are trying to create would probably appear invisible to callers. (you could test this by creating it, doing the PUT of that text file into place, see what happens)
We try to define what an FS should do in The Hadoop Filesystem Specification, including defining things "so obvious" that nobody bothered to write them down or write tests for, such as
Only directories can have children
All children must have a parent
Only files can have data (exception: ReiserFS)
Files are as long as they say they are (this is why S3A doesn't support client-side encryption, BTW).
Every so often we discover some new thing we forgot to consider, which "real" filesystems enforce out the box, but which object stores don't. Then we add tests, try our best to maintain the metaphor except when the performance impact would make it unusable. Then we opt not to fix things and hope nobody notices. Generally, because people working with data in the hadoop/hive/spark space have those same preconceptions of what a filesystem does, those ambiguities don't actually cause problems in production.
Except of course eventual consistency, which is why you shouldn't be writing data straight to S3 from spark without a consistency service (S3Guard, consistent EMRFS), or a commit protocol designed for this world (S3A Committer, databricks DBIO).

Atomic delete for large amounts of files

I am trying to delete 10000+ files at once, atomically e.g. either all need to be deleted at once, or all need to stay in place.
Of course, the obvious answer is to move all the files into a temporary directory, and delete it recursively on success, but that doubles the amount of I/O required.
Compression doesn't work, because 1) I don't know which files will need to be deleted, and 2) the files need to be edited frequently.
Is there anything out there that can help reduce the I/O cost? Any platform will do.
EDIT: let's assume a power outage can happen anytime.
Kibbee is correct: you're looking for a transaction. However, you needn't depend on either databases or special file system features if you don't want to. The essence of a transaction is this:
Write out a record to a special file (often called the "log") that lists the files you are going to remove.
Once this record is safely written, make sure your application acts just as if the files have actually been removed.
Later on, start removing the files named in the transaction record.
After all files are removed, delete the transaction record.
Note that, any time after step (1), you can restart your application and it will continue removing the logically deleted files until they're finally all gone.
Please note that you shouldn't pursue this path very far: otherwise you're starting to reimplement a real transaction system. However, if you only need a very few simple transactions, the roll-your-own approach might be acceptable.
On *nix, moving files within a single filesystem is a very low cost operation, it works by making a hard link to the new name and then unlinking the original file. It doesn't even change any of the file times.
If you could move the files into a single directory, then you could rename that directory to get it out of the way as a truly atomic op, and then delete the files (and directory) later in a slower, non-atomic fashion.
Are you sure you don't just want a database? They all have transaction commit and rollback built-in.
I think what you are really looking for is the ability to have a transaction. Because the disc can only write one sector at a time, you can only delete the files one at a time. What you need is the ability to roll back the previous deletions if one of the deletes doesn't happen successfully. Tasks like this are usually reserved for databases. Whether or not your file system can do transactions depends on which file system and OS you are using. NTFS on Windows Vista supports Transactional NTFS. I'm not too sure on how it works, but it could be useful.
Also, there is something called shadow copy for Windows, which in the Linux world is called an LVM Snapshot. Basically it's a snapshot of the disc at a point in time. You could take a snapshot directly before doing the delete, and on the chance that it isn't successfully, copy the files back out of the snapshot. I've created shadow copies using WMI in VBScript, I'm sure that similar apis exist for C/C++ also.
One thing about Shadow Copy and LVM Snapsots. The work on the whole partition. So you can't take a snapshot of just a single directory. However, taking a snapshot of the whole disk takes only a couple seconds. So you would take a snapshot. Delete the files, and then if unsucessful, copy the files back out of the snapshot. This would be slow, but depending on how often you plan to roll back, it might be acceptable. The other idea would be to restore the entire snapshot. This may or may not be good as it would roll back all changes on the entire disk. Not good if your OS or other important files are located there. If this partition only contains the files you want to delete, recovering the entire snapshot may be easier and quicker.
Instead of moving the files, make symbolic links into the temporary directory. Then if things are OK, delete the files. Or, just make a list of the files somewhere and then delete them.
Couldn't you just build the list of pathnames to delete, write this list out to a file to_be_deleted.log, make sure that file has hit the disk (fsync()), then start doing the deletes. After all the deletes have been done, remove the to_be_deleted.log transaction log.
When you start up the application, it should check for the existence of to_be_deleted.log, and if it's there, replay the deletes in that file (ignoring "does not exist" errors).
The basic answer to your question is "No.". The more complex answer is that this requires support from the filesystem and very few filesystems out there have that kind of support. Apparently NT has a transactional FS which does support this. It's possible that BtrFS for Linux will support this as well.
In the absence of direct support, I think the hardlink, move, remove option is the best you're going to get.
I think the copy-and-then-delete method is pretty much the standard way to do this. Do you know for a fact that you can't tolerate the additional I/O?
I wouldn't count myself an export at file systems, but I would imagine that any implementation for performing a transaction would need to first attempt to perform all of the desired actions, and then it would need to go back and commit those actions. I.E. you can't avoid performing more I/O than doing it non-atomically.
Do you have an abstraction layer (e.g. database) for reaching the files? (If your software goes direct to the filesystem then my proposal does not apply).
If the condition is "right" to delete the files, change the state to "deleted" in your abstraction layer and begin a background job to "really" delete them from the filesystem.
Of course this proposal incurs a certain cost at opening/closing of the files but saves you some I/O on symlink creation etc.
On Windows Vista or newer, Transactional NTFS should do what you need:
HANDLE txn = CreateTransaction(NULL, 0, 0, 0, 0, NULL /* or timeout */, TEXT("Deleting stuff"));
if (txn == INVALID_HANDLE_VALUE) {
/* explode */
}
if (!DeleteFileTransacted(filename, txn)) {
RollbackTransaction(txn); // You saw nothing.
CloseHandle(txn);
die_horribly();
}
if (!CommitTransaction(txn)) {
CloseHandle(txn);
die_horribly();
}
CloseHandle(txn);

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.