Rename and Move S3 files based on their folders name in pyspark

Rename and Move S3 files based on their folders name in pyspark - amazon-web-services

I am wringing some dataframes using partitionBy to S3. The folder structure that gets created is as below.
root/
date=2018-01-01/
date=2018-01-02/
I want to move these files to another directory in s3 and rename the folders as
root1/
20180101/
20180102/
Is there a way that I can achieve this from pyspark?
Also i need the files to be renamed in a sequential way inside the directories,e.g :
root1/
20180101/FILE_1.csv
20180101/FILE_2.csv

You can't directly rename S3 objects.
So one way to achieve this is to copy the objects to have desirable name and then delete the original objects.
Also, S3 buckets do not have a directory structure, "directory structure" is just prefixes in the objects' keys.
You have two options, either call aws cli from python using subprocees or use boto3 library to copy all the files from one "directory" to another.
Solution using subprocess:
import subprocess
subprocess.check_call("aws s3 sync s3://bucket/root/date=2018-01-01/ s3://bucket/root1/20180101/".split())
sync command will copy recursively. Then you can use aws s3 rm --recursive "somepath" . Call it using subprocess again.

Related

How can I retrieve a folder from S3 into an AWS SageMaker notebook

I have a folder with several files corresponding to checkpoints of a RL model trained using RLLIB. I want to make an analysis of the checkpoints in a way that I need to pass a certain folder as an argument, e.g., analysis_function(folder_path). I have to run this line on a SageMaker notebook. I have seen that there are some questions on SO about how to retrieve files from s3, such as this one. However; how can I retrieve a whole folder?

To read the whole folder, you will just have to list all files in the folder and loop through them. You could either do something like -
import boto3
s3_res = boto3.resource("s3")
my_bucket = s3.Bucket("<your-bucket-name>")
for object in my_bucket.objects.filter(Prefix="<your-prefix>")
# your code goes here
Or, simply download the files to your local storage and loop them as you see fit (copy reference)-
!aws s3 cp s3://bucket/prefix/ . --recursive

AWS S3 - Use powershell to delete all files but keep the folders

I have a powershell script, that downloads all files form an S3 bucket, and then removes the files from the bucket. All the files I'm removing are stored in a subfolder in the S3 bucket, and I just want to delete the files but maintain the subfolders.
I'm currently using the following command to delete the files in S3 once the file has been downloaded from S3.
Remove-S3Object -BucketName $S3Bucket -Key $key -Force
My problem is that if it removes all the files in the subfolder, the subfolder is removed as well. Is there a way to remove the file, but keep the subfolder present using powerhsell. I believe I can do something like this,
aws s3 rm s3://<key_to_be_removed> --exclude "<subfolder_key>"
but not quite sure if that'll work.
I'm looking for the best way to accomplish this, and at the moment, my only option is to recreate the subfolder via the script if the subfolder not longer exists.

The only way to accomplish having an empty folder is to create a zero-length object which has the same name as the folder you want to keep. This is actually how the S3 console enables you to create an empty folder.
You can check this by running $ aws s3 ls s3://your-bucket/folderfoo/ and observing an output object having length of zero bytes.
See more on this topic here.

As already commented, S3 does not really have folders the way file systems do. The folders as presented by most S3 browsers are just generated based on the paths of the files/objects. If you upload an object/file named folder/file, the browsers will present folder as folder with file as a file in the folder. But technically, all that there is is the file/object folder/file. The folder does not exist on its own.
You can explicitly create a folder by creating an empty empty-named object with "in the folder": folder/. If you do that, it will appear the the folder exists even if there are no files in it. But if you do not do that, the virtual folder disappears once you remove all objects in the folder.
Now the question is whether your command removes even the empty named object representing the folder or not. I cannot tell that.

Rename parent folder in S3

We know that in S3 objects, the folder path is just a part of the object key itself.
Having an object structure similar to the following:
/files/user/09874/01/
/files/user/09875/01/
/files/user/09875/02/
/files/user/09876/01/
/files/user/09876/02/
What kind of operation would you recommend to rename the parent /files/ to /something/, having in mind that there are thousands of files and that the number of requests should be the cheapest/minimum?
(with the following docs under consideration)

As you said prefix are part of the file name. Unfortunately, The only way that i can think of is to iterate the list of objects and move to the new prefix.
in cli:
aws s3 mv --recursive s3://bucket/files s3://bucket/something/

How to copy only files from many subdirectory under the directory to another project bucket in GCP?

I have huge number of data in my Google Cloud storage bucket. I have to copy all the files to another project bucket. But the main problem is, in this bucket i created some folder and under this folder have many sub-folders and all sub-folders have data. So when i am using normal gsutil copy command then its copying all the data along with folders.
I need help to resolve this problem. Because it is taking too much time to copy from one project to another project bucket.

You can use this command to have all the files in the root path.
gsutil cp 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
If you have nested directories inside your bucket, use this command:
gsutil cp -r 'gs://[YOUR_FIRST_BUCKET_NAME]/*' gs://[YOUR_SECOND_BUCKET_NAME]
Pay attention to single quotes around the first command.
You can take a look at the Wildcard Names if you need more advanced features.

You can use Google Data Transfer Service
It is the second option in the Google Cloud Storage subcategory.

Use gsutil cp command without -r option.
The -R and -r options are synonymous. Causes directories,
buckets, and bucket subdirectories to be copied recursively.
If you neglect to use this option for an upload, gsutil will
copy any files it finds and skip any directories. Similarly,
neglecting to specify this option for a download will cause
gsutil to copy any objects at the current bucket directory
level, and skip any subdirectories.

If I understand well, you want to copy all the files from one bucket to another bucket, but you don't want to have the same hierarchy, instead, you want to have all the files in the root path.
Nowadays there’s no possible way to do that with gsutil, but you can do it with a script, here you have my solution:
from google.cloud import storage
bucketOrigin = storage.Client().get_bucket("<BUCKET_ID_ORIGIN>")
bucketDestination = storage.Client().get_bucket("<BUCKET_ID_DESTINATION")
for blob in bucketOrigin.list_blobs():
strfile=blob.download_as_string()
blobDest = bucketDestination.blob(blob.name[blob.name.rfind("/")+1:])
blobDest.upload_from_string(strfile)

As mentioned by Akash Dathan, you can use the Cloud Storage Transfer Service to move your bucket content. I recommend you to take a look on this Moving and Renaming Buckets guide, where you can find the steps required to perform this task.
Bear in mind the following requirments:
Transfer Service service account must have permission to read from
your source and write to your destination.
If you're deleting the source files, the Transfer Service's service account will need delete access to the source.
If your service account doesn't have these
permissions yet, a bucket owner must grant them.
Note. If you have 'storage.buckets.setIamPolicy' permission for the source and destination buckets, creating a transfer job will grant that service account the required source and destination permissions to complete the transfer.

You can list all files from your subfolders and get the file name by using split() method. Then you can use use a copy() method to copy the file to another bucket. The method below remove all subfolders:
const [files] = await storage.bucket(srcBucketName).getFiles();
files.forEach((file) => {
let fileName = file.name.split("/").pop();
if (fileName)
file.copy(storage.bucket(destBucketName).file(`${prefix}/${fileName}`));
});

Sync command for OpenStack Object Storage (like S3 Sync)?

Using the S3 CLI, I can sync a local directory with an S3 bucket using the following command:
aws s3 sync s3://mybucket/ ./local_dir/
This command is a complete sync. It uploads new files, updates changed files, and deletes removed files. I am trying to figure out how to do something equivalent using the OpenStack Object Storage CLI:
http://docs.openstack.org/cli-reference/content/swiftclient_commands.html
The upload command has a --changed option. But I need a complete sync that is also capable of deleting local files that were removed.
Does anyone know if I can do something equivalent to s3 sync?

The link you mentioned has this :
`
objects –
A list of file/directory names (strings) or SwiftUploadObject instances containing a source for the created object, an object name, and an options dict (can be None) to override the options for that individual upload operation`
I'm thinking, if you pass the directory and the --changed option it should work.
I don't have a swift to test with. Can you try again?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rename and Move S3 files based on their folders name in pyspark - amazon-web-services

Related

How can I retrieve a folder from S3 into an AWS SageMaker notebook

AWS S3 - Use powershell to delete all files but keep the folders

Rename parent folder in S3

How to copy only files from many subdirectory under the directory to another project bucket in GCP?

Sync command for OpenStack Object Storage (like S3 Sync)?

Categories

Resources