Sync command for OpenStack Object Storage (like S3 Sync)? - amazon-web-services

Using the S3 CLI, I can sync a local directory with an S3 bucket using the following command:
aws s3 sync s3://mybucket/ ./local_dir/
This command is a complete sync. It uploads new files, updates changed files, and deletes removed files. I am trying to figure out how to do something equivalent using the OpenStack Object Storage CLI:
http://docs.openstack.org/cli-reference/content/swiftclient_commands.html
The upload command has a --changed option. But I need a complete sync that is also capable of deleting local files that were removed.
Does anyone know if I can do something equivalent to s3 sync?

The link you mentioned has this :
`
objects –
A list of file/directory names (strings) or SwiftUploadObject instances containing a source for the created object, an object name, and an options dict (can be None) to override the options for that individual upload operation`
I'm thinking, if you pass the directory and the --changed option it should work.
I don't have a swift to test with. Can you try again?

Related

How can I retrieve a folder from S3 into an AWS SageMaker notebook

I have a folder with several files corresponding to checkpoints of a RL model trained using RLLIB. I want to make an analysis of the checkpoints in a way that I need to pass a certain folder as an argument, e.g., analysis_function(folder_path). I have to run this line on a SageMaker notebook. I have seen that there are some questions on SO about how to retrieve files from s3, such as this one. However; how can I retrieve a whole folder?
To read the whole folder, you will just have to list all files in the folder and loop through them. You could either do something like -
import boto3
s3_res = boto3.resource("s3")
my_bucket = s3.Bucket("<your-bucket-name>")
for object in my_bucket.objects.filter(Prefix="<your-prefix>")
# your code goes here
Or, simply download the files to your local storage and loop them as you see fit (copy reference)-
!aws s3 cp s3://bucket/prefix/ . --recursive

Need to export the path/url of each file in Amazon S3 server

I have an Amazon S3 server filled with multiple buckets, each bucket containing multiple subfolders. There are easily 50,000 files in total. I need to generate an excel sheet that contains the path/url of each file in each bucket.
For eg, If I have a bucket called b1, and it has a file called f1.txt, I want to be able to export the path of f1 as b1/f1.txt.
This needs to be done for every one of the 50,000 files.
I have tried using S3 browsers like Expandrive and Cyberduck, however they require you to select each and every file to copy their urls.
I also tried exploring the boto3 library in python, however I did not come across any in built functions to get the file urls.
I am looking for any tool I can use, or even a script I can execute to get all the urls. Thanks.
Do you have access to the aws cli? aws s3 ls --recursive {bucket} will list all nested files in a bucket.
Eg this bash command will list all buckets, then recursively print all files in each bucket:
aws s3 ls | while read x y bucket; do aws s3 ls --recursive $bucket | while read x y z path; do echo $path; done; done
(the 'read's are just to strip off uninteresting columns).
nb I'm using v1 CLI.
What you should do is have a look again at boto3 documentation as it is what you are looking for. It is fairly simple to do what you are asking but may take you a bit of reading if you are new to it. Since there is multiple steps involved I will try to steer you in the right direction.
In boto3 for S3 the method you are looking for is list_objects_v2(). This will give you the 'Key' or object path of every object. You will notice that it will return the entire json blob for each object. Since you only are interested in the Key, you can target this just the same way you would access Key/Values in a dict. E.g. list_objects_v2()['Contents'][0]['Key'] should return only object path of the very first object.
If you've got that working the next step is to try to loop and get all values. You can either use a for loop to do this or there is an awesome python package I regularly use called jmespath - https://jmespath.org/
Here is how you can retrieve all object paths up to 1000 objects in one line.
import jmespath
bucket_name='im-a-bucket'
s3_client = boto3.client('s3')
bucket_object_paths = jmespath.search('Contents[*].Key', s3_client.list_objects_v2(Bucket=bucket_name))
Now since your buckets may have more than 1000 objects, you will need to use the paginator to do this. Have a look at this to understand it.
How to get more than 1000 objects from S3 by using list_objects_v2?
Basically the way it works is only 1000 objects can be returned. To overcome this we use a paginator which allows you to return the entire result and treats the limit of 1000 as a pagination so you just need to also use it within a for loop to get all the results you are looking for.
Once you get this working for one bucket, store the result in a variable which will be of type list and repeat for the rest of the buckets. Once you have all this data you could easily just copy paste it into an excel sheet or use python to do it. (Haven't tested the code snippets but they should work).
Amazon s3 inventory can help you with this use case.
Do evaluate that option. refer: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html

How to keep both directory and s3 bucket updated via aws-cli?

I want to synchronize a directory on my machine with a bucket in s3. The problem is that I find that the aws cli option of sync does not seem to do what I expected.
The behavior that I'm looking for is such that when I run the command it evaluates the content of the local directory and the content of the s3 bucket and updates the one with the old content with the changes in the other.
Can't guess the best approach. Thanks in advance.
I think it's a one way sync. If anything is the source is new or updated then it will sync it to the destination. Even if you may have a new file in the destination, it won't sync that down to your source. You would need to do a sync again from destination back to source. It's a one way sync, from source to destination.
You can't do that with awscli. aws sync s3://bucket . will synchronize the data in both sides
If you want a tool that acts according to your needs you have to develop a python script using https://github.com/boto/boto3
Npm package (and cli command) s3-cli can sync a local folder and an S3 bucket, both ways.
s3-cli sync [--delete-removed] /path/to/folder/ s3://bucket/key/on/s3/
s3-cli sync [--delete-removed] s3://bucket/key/on/s3/ /path/to/folder/

Rename and Move S3 files based on their folders name in pyspark

I am wringing some dataframes using partitionBy to S3. The folder structure that gets created is as below.
root/
date=2018-01-01/
date=2018-01-02/
I want to move these files to another directory in s3 and rename the folders as
root1/
20180101/
20180102/
Is there a way that I can achieve this from pyspark?
Also i need the files to be renamed in a sequential way inside the directories,e.g :
root1/
20180101/FILE_1.csv
20180101/FILE_2.csv
You can't directly rename S3 objects.
So one way to achieve this is to copy the objects to have desirable name and then delete the original objects.
Also, S3 buckets do not have a directory structure, "directory structure" is just prefixes in the objects' keys.
You have two options, either call aws cli from python using subprocees or use boto3 library to copy all the files from one "directory" to another.
Solution using subprocess:
import subprocess
subprocess.check_call("aws s3 sync s3://bucket/root/date=2018-01-01/ s3://bucket/root1/20180101/".split())
sync command will copy recursively. Then you can use aws s3 rm --recursive "somepath" . Call it using subprocess again.

How to copy only files from many subdirectory under the directory to another project bucket in GCP?

I have huge number of data in my Google Cloud storage bucket. I have to copy all the files to another project bucket. But the main problem is, in this bucket i created some folder and under this folder have many sub-folders and all sub-folders have data. So when i am using normal gsutil copy command then its copying all the data along with folders.
I need help to resolve this problem. Because it is taking too much time to copy from one project to another project bucket.
You can use this command to have all the files in the root path.
gsutil cp 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
If you have nested directories inside your bucket, use this command:
gsutil cp -r 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
Pay attention to single quotes around the first command.
You can take a look at the Wildcard Names if you need more advanced features.
You can use Google Data Transfer Service
It is the second option in the Google Cloud Storage subcategory.
Use gsutil cp command without -r option.
The -R and -r options are synonymous. Causes directories,
buckets, and bucket subdirectories to be copied recursively.
If you neglect to use this option for an upload, gsutil will
copy any files it finds and skip any directories. Similarly,
neglecting to specify this option for a download will cause
gsutil to copy any objects at the current bucket directory
level, and skip any subdirectories.
If I understand well, you want to copy all the files from one bucket to another bucket, but you don't want to have the same hierarchy, instead, you want to have all the files in the root path.
Nowadays there’s no possible way to do that with gsutil, but you can do it with a script, here you have my solution:
from google.cloud import storage
bucketOrigin = storage.Client().get_bucket("<BUCKET_ID_ORIGIN>")
bucketDestination = storage.Client().get_bucket("<BUCKET_ID_DESTINATION")
for blob in bucketOrigin.list_blobs():
strfile=blob.download_as_string()
blobDest = bucketDestination.blob(blob.name[blob.name.rfind("/")+1:])
blobDest.upload_from_string(strfile)
As mentioned by Akash Dathan, you can use the Cloud Storage Transfer Service to move your bucket content. I recommend you to take a look on this Moving and Renaming Buckets guide, where you can find the steps required to perform this task.
Bear in mind the following requirments:
Transfer Service service account must have permission to read from
your source and write to your destination.
If you're deleting the source files, the Transfer Service's service account will need delete access to the source.
If your service account doesn't have these
permissions yet, a bucket owner must grant them.
Note. If you have 'storage.buckets.setIamPolicy' permission for the source and destination buckets, creating a transfer job will grant that service account the required source and destination permissions to complete the transfer.
You can list all files from your subfolders and get the file name by using split() method. Then you can use use a copy() method to copy the file to another bucket. The method below remove all subfolders:
const [files] = await storage.bucket(srcBucketName).getFiles();
files.forEach((file) => {
let fileName = file.name.split("/").pop();
if (fileName)
file.copy(storage.bucket(destBucketName).file(`${prefix}/${fileName}`));
});