I am practicing AWS commands. My client has given me the AWS IAM access key and the secret but not the account that I can log in to the admin panel. Those keys are being used with the project itself. What I am trying to do is that I am trying to list down all the files recursive within a S3 bucket.
This is what I have done so far.
I have configured the AWS profile for CLI using the following command
aws configure
Then I could list all the available buckets by running the following command
aws s3 ls
Then I am trying to list all the files within a bucket. I tried running the following command.
aws s3 ls s3://my-bucket-name
But it seems like it is not giving me the correct content. Also, I need a way to navigate around the bucket too. How can I do that?
You want to list all of the objects recursively but aren't using --recursive flag. This will only show prefixes and any objects at the root level
Relevant docs https://docs.aws.amazon.com/cli/latest/reference/s3/ls.html
A few options.
Roll your own
if you run an aws s3 ls and a line item has the word "PRE" instead of a modify date and size, that means it's a "directory" that you can traverse. You can write a quick bash script to run recursive aws s3 ls commands on everything that returns "PRE" indicating it's hiding more files.
s3fs
Using the S3FS-Fuse project on github, you can mount an S3 bucket on your file system and explore it that way. I haven't tested this and thus can't personally recommend it, but it seems viable, and might be a simple way to use tools you already have and understand (like tree).
One concern that I might have, is when I've used software similar to this it has made a lot of API calls and if left mounted for the long-term, it can run up costs just by the number of API calls.
Sync everything to a localhost (not recommended)
Adding this for completion, but you can run
aws s3 sync s3://mybucket/ ./
This will try to copy everything to your computer and you'll be able to use your own filesystem. However, s3 buckets can hold petabytes of data, so you may not be able to sync it all to your system. Also, s3 provides a lot of strong security precautions to protect the data, which your personal computer probably doesn't.
Related
I have a log archive bucket, and that bucket has 2.5m+ objects.
I am looking to download some specific time period files. For this I have tried different methods but all of them are failing.
My observation is those queries start from oldest file, but the files I seek are the newest ones. So it takes forever to find them.
aws s3 sync s3://mybucket . --exclude "*" --include "2021.12.2*" --include "2021.12.3*" --include "2022.01.01*"
Am I doing something wrong?
Is it possible to make these query start from newest files so it might take less time to complete?
I also tried using S3 Browser and CloudBerry. Same problem. Tried with a EC2 that is inside the same AWS network. Same problem.
2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!
When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).
You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.
Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.
The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.
Currently, my S3 bucket contains files. I want to create a folder for each file on S3.
Current -> s3://<bucket>/test.txt
Expectation -> s3://<bucket>/test/test.txt
How can I achieve this using the EC2 instance?
S3 doesn't have "folders" really, object names may contain / characters in them and that in a way emulates folders. Simply name your objects test/<filename> to achieve that. See the S3 docs for more.
As for doing it from EC2, it is no different from doing it from anywhere else (except, maybe, in EC2 you may be able to rely on an IAM profile instead of using ad-hoc credentials). If you've tried it and failed, maybe post a new question with more details.
If you have Linux you can try something like:
aws s3 ls s3://bucket/ | while read date time size name; do aws s3 mv s3://bucket/${name} s3://bucket/`echo ${name%.*}`/${name}; done
it does not depend upon EC2 instance. You can use aws cli from EC2 instance or some other source with putting desired path, for your case s3:///test/test.txt. you can even change the name of the file you are copying into s3 bucket even its extension if you want.
I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.
The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.
I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.
We have the following workflow at my work:
Download the data from AWS s3 bucket to the workspace:
aws s3 cp --only-show-errors s3://bucket1
Unzip the data
unzip -q "/workspace/folder1/data.zip" -d "/workspace/folder2"
Run a java command
java -Xmx1024m -jar param1 etc...
Sync the archive back to the s3 target bucket
aws s3 sync --include #{archive.location} s3://bucket
As you can see that the downloading data from s3 bucket, unzipping, running some java operation on the data and copying back to s3 costs a lot of time and resources.
Hence, we are planning to unzip directly in the s3 target bucket and run java operation there. Would it be possible to run the java operation directly in s3 bucket? If yes, could you please provide some insights?
Its not possible to run the java 'in S3', but what you can do is move your Java code to an AWS Lambda function, and all the work can be done 'in the cloud', i.e., no need to download to a local machine, process and re-upload.
Without knowing the details of you requirements, I would consider setting up an S3 notification request that gets invoked each time a new file gets PUT into a particular location, and AWS Lambda function that gets invoked with the details of that new file, and then have Lambda output the results to a different bucket/location with the results.
I have done similar things (though not with java) and have found it rock solid way of processing files.
No.
You cannot run code on S3.
S3 is an object store, which don't provide any executing environment. To do any modifications to the files, you need to download it, modify and upload back to S3.
If you need to do operations on files, you can look into using AWS Elastic File System which you can mount to your EC2 instance and do the operations as required.
We have options to :
1. Copy file/object to another S3 location or local path (cp)
2. List S3 objects (ls)
3. Create bucket (mb) and move objects to bucket (mv)
4. Remove a bucket (rb) and remove an object (rm)
5. Sync objects and S3 prefixes
and many more.
But before using the commands, we need to check if S3 service is available in first place. How to do it?
Is there a command like :
aws S3 -isavailable
and we get response like
0 - S3 is available, I can go ahead upload object/create bucket etc.
1 - S3 is not availble, you can't upload object etc. ?
You should assume that Amazon S3 is available. If there is a problem with S3, you will receive an error when making a call with the Amazon CLI.
If you are particularly concerned, then add a simple CLI command first, eg aws s3 ls and throw away the results. But that's really the same concept. Or, you could use the --dry-run option available on many commands that simply indicates whether you would have had sufficient permissions to make the request, but doesn't actually run the request.
It is more likely that you will have an error in your configuration (eg wrong region, credentials not valid) than S3 being down.