How to copy only files from many subdirectory under the directory to another project bucket in GCP? - google-cloud-platform

I have huge number of data in my Google Cloud storage bucket. I have to copy all the files to another project bucket. But the main problem is, in this bucket i created some folder and under this folder have many sub-folders and all sub-folders have data. So when i am using normal gsutil copy command then its copying all the data along with folders.
I need help to resolve this problem. Because it is taking too much time to copy from one project to another project bucket.

You can use this command to have all the files in the root path.
gsutil cp 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
If you have nested directories inside your bucket, use this command:
gsutil cp -r 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
Pay attention to single quotes around the first command.
You can take a look at the Wildcard Names if you need more advanced features.

You can use Google Data Transfer Service
It is the second option in the Google Cloud Storage subcategory.

Use gsutil cp command without -r option.
The -R and -r options are synonymous. Causes directories,
buckets, and bucket subdirectories to be copied recursively.
If you neglect to use this option for an upload, gsutil will
copy any files it finds and skip any directories. Similarly,
neglecting to specify this option for a download will cause
gsutil to copy any objects at the current bucket directory
level, and skip any subdirectories.

If I understand well, you want to copy all the files from one bucket to another bucket, but you don't want to have the same hierarchy, instead, you want to have all the files in the root path.
Nowadays there’s no possible way to do that with gsutil, but you can do it with a script, here you have my solution:
from google.cloud import storage
bucketOrigin = storage.Client().get_bucket("<BUCKET_ID_ORIGIN>")
bucketDestination = storage.Client().get_bucket("<BUCKET_ID_DESTINATION")
for blob in bucketOrigin.list_blobs():
strfile=blob.download_as_string()
blobDest = bucketDestination.blob(blob.name[blob.name.rfind("/")+1:])
blobDest.upload_from_string(strfile)

As mentioned by Akash Dathan, you can use the Cloud Storage Transfer Service to move your bucket content. I recommend you to take a look on this Moving and Renaming Buckets guide, where you can find the steps required to perform this task.
Bear in mind the following requirments:
Transfer Service service account must have permission to read from
your source and write to your destination.
If you're deleting the source files, the Transfer Service's service account will need delete access to the source.
If your service account doesn't have these
permissions yet, a bucket owner must grant them.
Note. If you have 'storage.buckets.setIamPolicy' permission for the source and destination buckets, creating a transfer job will grant that service account the required source and destination permissions to complete the transfer.

You can list all files from your subfolders and get the file name by using split() method. Then you can use use a copy() method to copy the file to another bucket. The method below remove all subfolders:
const [files] = await storage.bucket(srcBucketName).getFiles();
files.forEach((file) => {
let fileName = file.name.split("/").pop();
if (fileName)
file.copy(storage.bucket(destBucketName).file(`${prefix}/${fileName}`));
});

Related

Move large number of folders and files inside a GCS bucket

I have a bucket on GCP and at the top level of this bucket, I have a bunch of folders.
I want to create a new folder and move all of the other ones into it.
However, I've mounted my bucket with gcsfuse and tried traditional Linux mv commands. This is not allowed, apparently.
Likewise, I have also tried gsutil -m mv gs://mybucket/* gs://mybucket/new_folder/ and have received the command error that wildcards are not allowed in this operation.
What's the best option to get this large number of files moved into a new directory?
Posting this as a Community Wiki answer, based in the comments provided by #JohnHanley.
A few concepts to note for Cloud Storage.
Objects are immutable, which means you cannot rename then. You must copy objects and delete the original to emulate changing the name.
Directories/Folders do not exist. The namespace is flat, all objects are in the root directory. The appearance of folders is just a part of the object name.
Cloud Storage supports internal object copy. Be careful not to use a feature which first downloads the object and then uploads it.
Considering this information, you will need to use a tool, for example, the gsutil, so you can start to rename and move the files as you would like.

Copying objects from one bucket directory folder to another bucket folder using transfer

I'm wanting to use google transfer to copy all folders/files in a specific directory in Bucket-1 to the root directory of Bucket-2.
Have tried to use transfer with the filter option but doesn't copy anything across.
Any pointers on getting this to work within transfer or step by step for functions would be really appreciated.
I reproduced your issue and worked for me using gsutil.
For example:
gsutil cp -r gs://SourceBucketName/example.txt gs://DestinationBucketName
Furthermore, I tried to copy using Transfer option and it also worked. The steps I have done with Transfer option are these:
1 - Create new Transfer Job
Panel: “Select Source”:
2 - Select your source for example Google Cloud Storage bucket
3 - Select your bucket with the data which you want to copy.
4 - On the field “Transfer files with these prefixes” add your data (I used “example.txt”)
Panel “Select destination”:
5 - Select your destination Bucket
Panel “Configure transfer”:
6 - Run now if you want to complete the transfer now.
7 - Press “Create”.
For more information about copy from a bucket to another you can check the official documentation.
So, a few things to consider here:
You have to keep in mind that Google Cloud Storage buckets don’t treat subdirectories the way you would expect. To the bucket it is basically all part of the file name. You can find more information about that in the How Subdirectories Work documentation.
The previous is also the reason why you cannot transfer a file that is inside a “directory” and expect to see only the file’s name appear in the root of your targeted bucket. To give you an example:
If you have a file at gs://my-bucket/my-bucket-subdirectory/myfile.txt, once you transfer it to your second bucket it will still have the subdirectory in its name, so the result will be: gs://my-second-bucket/my-bucket-subdirectory/myfile.txt
This is why, If you are interested in automating this process, you should definitely give the Google Cloud Storage Client Libraries a try.
Additionally, you could also use the GCS Client with Google Cloud Functions. However, I would just suggest this if you really need the Event Triggers offered by GCF. If you just want the transfer to run regularly, for example on a cron job, you could still use the GCS Client somewhere other than a Cloud Function.
The Cloud Storage Tutorial might give you a good example of how to handle Storage events.
Also, on your future posts, try to provide as much relevant information as possible. For this post, as an example, it would’ve been nice to know what file structure you have on your buckets and what you have been getting as an output. And If you can provide straight away what’s your use case, it will also prevent other users from suggesting solutions that don’t apply to your needs.
try this in Cloud Shell in the project
gsutil cp -r gs://bucket1/foldername gs://bucket2

Deleting a large folder from Google Cloud Storage

I have a folder in a Google Cloud Storage bucket, which has millions of files that I need to remove.
What is an efficient way to delete this large folder of files, without having to delete the entire bucket?
I've tried using the gsutil rm command, but it seems like it will take a long time to finish deleting all files.
Furthermore, I also read about Object Lifecycle Management policies, but I read that they apply to the entire bucket, as opposed to any specific folder.
Thanks for your help! :)
gsutil rm will be fastest. Your only alternative is to write code to list and delete each one, which is what gsutil is going to do for you.
Try the Storage Transfer Service, my experience is it can delete 1-1.5k objects per second.
Delete files by copying from a source bucket (or folder in a bucket) that is empty, to a destination bucket or folder that you want to be empty and use the delete files at destination that aren't at the source option.
If using the GUI select this bullet in the advanced transfer options dialog:
You can also create and run the job from the CLI. This example assumes you have access to gs://bucket1/empty/ (which has no objects in it) and you want to delete all objects from gs://bucket2/folder1/:
gcloud transfer jobs create \
gs://bucket1/empty/ gs://bucket2/folder1/ \
--delete-from=destination-if-unique \
--project my-project
If you want your deletes to happen even faster you'll need to create multiple transfer jobs and have them target different sections of the bucket. Because it has to do a bucket listing to find the files to delete you'd want to make the destination paths non-overlapping (e.g. gs://bucket2/folder1/ and gs://bucket2/folder2/, etc). Each job will process in parallel at speed getting the job done in less total time.

Copy all objects to another S3 bucket in different region with different structure

I have an S3 bucket in Region A structured like this:
ProviderA-1-1
31423423.jpg
ProviderB-1-1
32423432.jpg
The top level folder is a unique image identifier. The filename is the version of the image.
i want to copy the images to a bucket in Region B, structured like this:
ProviderA-1-1.jpg
ProviderB-1-1.jpg
E.g i don't care about the version. I just want the folder name (which is unique) to be the filename.
The reason i'm doing this is to have a flat structure to make use of image services like Imgix / ImageKit. (they provide on the fly image transformation for images, given a flat source origin)
So, my requirements are:
I need to copy lots (millions of images, ~10TB) of images
The destination bucket is in another region
I need to 'flatten' the structure, and change the name of the images to be the name of the folder they are in (folder names isn't fixed)
I've seen a few answers here suggesting the aws cli is the best approach, but not sure how i can achieve 3. with that?
Sounds like i need to loop through the images one by one, changing the name before i copy. If a script is suggested, i'm most comfortable with .NET - so perhaps the AWS .NET SDK?
This is a once off job, where i need to move the images as quickly and cheaply as possible.
Advice please?
Thanks :)
Yes, a script is required because you are moving and renaming the files.
If you're comfortable with .NET, then use that!
The basic program would be:
Create two S3 clients -- one for source bucket (to obtain the listing) and one for the destination bucket (because copy commands are sent to the destination bucket, which pulls the file from the source bucket) because you are using a different region
Use ListObjects() to obtain a list of the source bucket. Note that it will return 1000 files at a time, so use NextMarker to request the subsequent batch.
Loop through each file and use CopyObject() to simultaneously copy and rename the file. Use your own logic to take the folder name and convert it to a filename. Each file will be copied directly between the buckets, without needing to download/upload
Continue, looping through the list of 1000 files and then get the next 1000 files, etc.
The process could be sped up by using multi-threading but the logic gets a bit hard. It might be easier to simply run a few copies of the program at the same time, each handling a different Prefix range (effectively, folder names).
It's a one-off job, so optimization isn't important.
If you are adding more files in future, the best method would be to create an AWS Lambda function that is triggered whenever a new file is created in S3. The Lambda function would then copy the file to the destination, then exit.
Assuming you have no location constraints set up for your buckets, flattening would simply be:
aws s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/
assumes you have the CLI installed and required credentials setup correctly. Or you can pass them on command line:
aws --profile profile_A2B --region XXX s3 cp --recursive s3://source_bucket/foo/ s3://target_bucket/ --acl yyy
You don't mention any performance requirements. There are many ways of making transfer faster, depends on many factors. Few blind hints I can give are:
See if transfer acceleration can help you.
In general S3 to S3 transfer is faster than S3 to/from non-S3 location.
See if you can create parallel batches by prefix like:
.
for prefix in {a..z}
do
aws s3 cp --recursive s3://source_bucket/foo/${prefix}* s3://target_bucket/ &
done
If this is not a one time transfer and the transfer acceleration isn't cutting it for you, consider:
download from S3 (in region A) to a local HDD residing in region A.
transfer from local HDD in region A to a local HDD in region B using other methods like Aspera or FileCatalyst or whatever else you can find.
upload from local HDD in region B to S3 (in region B).
I have no practical data to share except that Aspera blows things like FTP out of water, it's not even a competition. YMMV.
John already covered the pseudo code. I'll just make one change to it. Write two separate programs, one to fetch the list of filenames and second to copy. It takes a lot of time to list files if you have millions of them.
Once you've listed the file names in a file, say one per line, it would be pretty easy to parallelize given you can split the file (say split -l 1000 file_list splits).
Use xargs -P or gun parallel to run multiple aws s3 cp commands at once. If you're using shell instead of .NET.
Finally don't forget to set the ACL (and other attributes like TTL etc) on target files during the copy. Doing that after the copy will take a long time.

Sync command for OpenStack Object Storage (like S3 Sync)?

Using the S3 CLI, I can sync a local directory with an S3 bucket using the following command:
aws s3 sync s3://mybucket/ ./local_dir/
This command is a complete sync. It uploads new files, updates changed files, and deletes removed files. I am trying to figure out how to do something equivalent using the OpenStack Object Storage CLI:
http://docs.openstack.org/cli-reference/content/swiftclient_commands.html
The upload command has a --changed option. But I need a complete sync that is also capable of deleting local files that were removed.
Does anyone know if I can do something equivalent to s3 sync?
The link you mentioned has this :
`
objects –
A list of file/directory names (strings) or SwiftUploadObject instances containing a source for the created object, an object name, and an options dict (can be None) to override the options for that individual upload operation`
I'm thinking, if you pass the directory and the --changed option it should work.
I don't have a swift to test with. Can you try again?