s3zipper has limited to download only 1000 files from s3 bucket

s3zipper has limited to download only 1000 files from s3 bucket - amazon-web-services

I am using s3zipper along with PHP to stream zip S3 files. However there is one issue. We have more than 1000 of files to download (approx 2K to 10K varying). So when we send request to s3zipper lets say 1500 files, we were getting only 1000 files within a zip.
As per AWS docs they have 1000 keys limitation i.e.
S3 API version 2 implementation of the GET operation returns some or all (up to 1,000) of the objects in a bucket.
. So if we want to get more than than we have to use marker parameter AWS A. But in s3zipper.go this call aws_bucket.GetReader(file.S3Path), is reading file and adding to zip.I am not sure how I can use marker in this case.
I am curious how we can come over from this limitation. I am newbie to Go language, any help in this regard will be highly appreciated.

Related

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?

You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).

try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?

An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.

Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily

Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

More efficient use of aws s3 sync?

Lately, we've noticed that our AWS bill has been higher than usual. It's due to adding an aws s3 sync task to our regular build process. The build process generates something around 3,000 files. After the build, we run aws s3 sync to upload them en masse into a bucket. The problem is that this is monetarily expensive. Each upload is costing us a ~$2 (we think) and this adds up to a monthly bill that raises the eyebrow.
All but maybe 1 or 2 of those files actually change from build to build. The rest are always the same. Yet aws s3 sync sees that they all changed and uploads the whole lot.
The documentation says that aws s3 sync compares the file's last modified date and byte size to determine if it should upload. The build server creates all those files brand-new every time, so the last modified date is always changed.
What I'd like to do is get it to compute a checksum or a hash on each file and then use that hash to compare the files. Amazon s3 already has the etag field which is can be an MD5 hash of the file. But the aws s3 sync command doesn't use etag.
Is there a way to use etag? Is there some other way to do this?
The end result is that I'd only like to upload the 1 or 2 files that are actually different (and save tremendous cost)

The aws s3 sync command has a --size-only parameter.
From aws s3 sync options:
--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.
This will likely avoid copying all files if they are updated with the same content.

As an alternative to s3 sync or cp you could use s5cmd
https://github.com/peak/s5cmd
This is able to sync files on the size and date if different, and also has speeds of up to 4.6gb/s
Example of the sync command:
AWS_REGION=eu-west-1 /usr/local/bin/s5cmd -stats cp -u -s --parents s3://bucket/folder/* /home/ubuntu

S3 charges $0.005 per 1,000 PUT requests (doc), so it's extremely unlikely that uploading 3,000 files is costing you $2 per build. Maybe $2 per day if you're running 50-100 builds a day, but that's still not much.
If you really are paying that much per build, you should enable CloudTrail events and see what is actually writing that much (for that matter, maybe you've created some sort of recursive CloudTrail event log).
The end result is that I'd only like to upload the 1 or 2 files that are actually different
Are these files the artifacts produced by your build? If yes, why not just add a build step that copies them explicitly?

The issue that I got was using wildcard * in the --include option. Using one wildcard was fine but when I added the second * such as /log., it looked like sync tried to download everything to compare, which took a lot of CPU and network bandwidth.

Merging files on AWS S3 (Using Apache Camel)

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again.
These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?
I am using Apache Camel for routing.

S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.
However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.
However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).
My production code does this by:
Interrogating the manifest of files to be uploaded
If first part is
under 5MB, download pieces* and buffer to disk until 5MB is buffered.
Append parts sequentially until file concatenation complete
If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.
Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.
* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.
** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1
P.S. When I have time to make a Gist of this code I'll post the link here.

You can use Multipart Upload with Copy to merge objects on S3 without downloading and uploading them again.
You can find some examples in Java, .NET or with the REST API here.

How do I delete/count objects in a s3 bucket?

So I know this is a common question but there just doesn't seem to be any good answers for it.
I have a bucket with gobs (I have no clue how many) number of files in them. They are all within 2k a piece.
1) How do I figure out how many of these files I have WITHOUT listing them?
I've used the s3cmd.rb, aws/s3, and jets3t stuff and the best I can find is a command to count the first 1000 records (really performing GETS on them).
I've been using jets3t's applet as well cause it's really nice to work with but even that I can't list all my objects cause I run out of heap space. (presumably cause it is peforming GETS on all of them and keeping them in memory)
2) How can I just delete a bucket?
The best thing I've seen is a paralleized delete loop and that has problems cause sometimes it tries to delete the same file. This is what all the 'deleteall' commands that I've ran across do.
What do you guys do who have boasted about hosting millions of images/txts?? What happens when you want to remove it?
3) Lastly, are there alternate answers to this? All of these files are txt/xml files so I'm not even sure S3 is such a concern -- maybe I should move this to a document database of sorts??
What it boils down to is that the amazon S3 API is just straight out missing 2 very important operations -- COUNT and DEL_BUCKET. (actually there is a delete bucket command but it only works when the bucket is empty) If someone comes up with a method that does not suck to do these two operations I'd gladly give up lots of bounty.
UPDATE
Just to answer a few questions. The reason I ask this was I have been for the past year or so been storing hundreds of thousands, more like millions of 2k txt and xml documents. The last time, a couple of months ago, I wished to delete the bucket it literally took DAYS to do so because the bucket has to be empty before you can delete it. This was such a pain in the ass I am fearing ever having to do this again without API support for it.
UPDATE
this rocks the house!
http://github.com/SFEley/s3nuke/
I rm'd a good couple gigs worth of 1-2k files within minutes.

I am most certainly not one of those 'guys do who have boasted about hosting millions of images/txts', as I only have a few thousand, and this may not be the answer you are looking for, but I looked at this a while back.
From what I remember, there is an API command called HEAD which gets information about an object rather than retrieving the complete object which is what GET does, which may help in counting the objects.
As far as deleting Buckets, at the time I was looking, the API definitely stated that the bucket had to be empty, so you need to delete all the objects first.
But, I never used either of these commands, because I was using S3 as a backup and in the end I wrote a few routines that uploaded the files I wanted to S3 (so that part was automated), but never bothered with the restore/delete/file management side of the equation. For that use Bucket Explorer which did all I need. In my case, it wasn't worth spending time when for $50 I can get a program that does all I need. There are probably others that do the same (eg CloudBerry)
In your case, with Bucket Explorer, you can right click on a bucket and select delete or right click and select properties and it will count the number of objects and the size they take up. It certainly does not download the whole object. (Eg the last bucket I looked it was 12Gb and around 500 files and it would take hours to download 12GB whereas the size and count is returned in a second or two). And if there is a limit, then it certainly isn't 1000.
Hope this helps.

"List" won't retrieve the data. I use s3cmd (a python script) and I would have done something like this:
s3cmd ls s3://foo | awk '{print $4}' | split -a 5 -l 10000 bucketfiles_
for i in bucketfiles_*; do xargs -n 1 s3cmd rm < $i & done
But first check how many bucketfiles_ files you get. There will be one s3cmd running per file.
It will take a while, but not days.

1) Regarding your first question, you can list the items on a bucket without actually retrieving them. You can do that both with the SOAP and the REST API. As you can see, you can define the maximum number of items to list and the position to start the listing from (the marker). Read more about it here.
I do not know of any implementation of the paging, but especially for the REST interface it would be very easy to implement it in any language.
2) I believe the only way to delete a bucket is to first empty it from all items. See alse this question.
3) I would say that S3 is very well suited for storing a large number of files. It depends however on what you want to do. Do you plan to also store binary files? Do you need to perform any queries or just listing the files is enough?

I've had the same problem with deleting hundreds of thousands of files from a bucket. It may be worthwhile to fire up an EC2 instance to run the parallel delete because the latency to S3 is low. I think there's some money to be made hosting a bunch of EC2 servers and charging people to delete buckets quickly. (At least until Amazon gets around to changing the API)

Old thread, but still relevant as I was looking for the answer until I just figured this out. I wanted a file count using a GUI-based tool (i.e. no code). I happen to already use a tool called 3Hub for drag & drop transfers to and from S3. I wanted to know how many files I had in a particular bucket (I don't think billing breaks it down by buckets).
So, using 3Hub,
- list the contents of the bucket (looks basically like a finder or explorer window)
- go to the bottom of the list, click 'show all'
- select all (ctrl+a)
- choose copy URLs from right-click menu
- paste the list into a text file (I use TextWrangler for Mac)
- look at the line count
I had 20521 files in the bucket and did the file count in less than a minute.
I'd like to know if anyone's found a better way since this would take some time on hundreds of thousands of files.

To count objects in an S3 bucket:
Go to AWS Billing, then reports, then AWS Usage reports.
Select Amazon Simple Storage Service, then Operation StandardStorage.
Download a CSV file that includes a UsageType of StorageObjectCount that lists the item count for each bucket.

Count
aws s3 ls s3://mybucket/ --recursive | wc -l
From this post
Delete
aws s3 rm --recursive s3://mybucket/ && aws s3 rb s3://mybucket/
This deletes every item then the bucket.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js