I have many very large files (> 6 GB) stored in an AWS S3 bucket that need very minor edits done to them.
I can edit these files by pulling them to a server, using sed or perl to edit the key word, and then pushing them back, but this is very time-consuming, especially for a one-word edit to a 6 or 7 GB text file.
I use a program that makes the AWS S3 like a random-access file system, https://github.com/s3fs-fuse/s3fs-fuse, but this is unusuably slow, so it isn't an option.
How can I edit these files, or use sed, via a script without the expensive and slow step of pulling from and pushing back to S3?
You can't.
The library you use certainly does it right: download the existing file, do the edit locally, then push back the results. It's always going to be slow.
With sed, it may be possible to make it faster, assuming your existing library does it in three separate steps. But you can't send the result right back and overwrite the file before you're done reading it (at least I would suggest not doing so.)
If this is a one time process, then the slowness should not be an issue. If that's something you are likely to perform all the time, then I'd suggest you use a different type of storage. This one may not be appropriate for your app.
Related
I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.
I am facing an issue in Spark while reading data as the input partitions are huge and I am getting Slow Down 503 error in Spark.
After checking with AWS team they mentioned this is happening in reading files since the request rate is too high.
One of the solution they provided is to combine small files into Bigger one so we can reduce the number of files. Does anyone knows how to merge the small files in S3 into bigger file ? Is there any utility available for doing this ?
Please note that, I am not referring to small part files under one partition. Say I have Level 1 partition as Created_date and level 2 partition VIN . I have one part file under each VIN, but there are too many partitions for VIN. So I am exploring if we can merge these several VIN's part files in S3 into generic CSV then we can handle this issue of S3 slow down.
Your answers are much appreciated!.
Thanks and regards,
Raghav Chandra Shetty
First off I'm not familiar with "Spark".
Combining files in S3 is not possible. S3 is just a place to put your files as is. I think what AWS support is telling you is that you can reduce the number of calls you make by simply having less files. So it's up to you BEFORE you upload your files to S3 to make then bigger (combine them). Either by placing more data into each file or creating a tarball/zip.
You can get similar if not better speeds, plus save on your request limit, by downloading 1, 100MB file then downloading 100, 1MB files. Then you can start taking advantage of the MultiPart Upload/Download feature of S3.
I'm working on a project that utilizes multiple docker containers
which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.
As an example heres the situation I'm trying to avoid:
Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.
I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).
The characteristics of the files/containers are:
- All files are small text files averaging 2kb in size (although rarely they can be 10 kb)
- Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year
- These containers are not in a swarm
- The output of each comparison are already being uploaded to S3
- Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor
(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)
Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:
The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.
Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.
In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.
I have an s3 bucket with tens of millions of relatively small json files, each less than 10 K.
To analyze them, I would like to merge them into a small number of files, each having one json per line (or some other separator), and several thousands of such lines.
This would allow me to more easily (and performantly) use all kind of big data tools out there.
Now, it is clear to me this cannot be done with one command or function call, but rather a distributed solution is needed, because of the amount of files involved.
The question is if there is something ready and packaged or must I pull out my own solution.
don't know of anything out there that can do this out of the box, but you can pretty easily do it yourself. the solution also depends a lot on how fast you need to get this done.
2 suggestions:
1) list all the files, split the list, download sections, merge and reupload.
2) list all the files, and after them go through them one at a time and read/download and write it to a kinesis steam. configure kinesis to dump the files to s3 via kinesis firehose.
In both scenarios the tricky bit is going to be handling failures and ensuring you don't get the data multiple times.
For completeness, if the files would be larger (>5MB) you could also leverage http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html which would allow you to merge files in S3 directly without having to download.
Assuming each json file is one line only, then I would do:
cat * >> bigfile
This will concat all files in a directory into the new file bigfile.
You can now read bigfile one line at a time, json decode the line and do something interesting with it.
If your json files are formatted for readability, then you will first need to combine all the lines in the file into one line.
I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.